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Preface 



Computing systems are widely used today and in many areas they serve the key 
function in achieving highly complicated and safety-critical mission. At the 
same time, the size and complexity of computing systems have continued to 
increase, making its performance evaluation more difficult than ever before. 

The purpose of this book is to provide a comprehensive coverage of tools 
and techniques for computing system reliability modeling and analysis. 
Reliability analysis is a useful tool in evaluating the performance of complex 
systems. Intensive studies have been carried out to improve the likelihood for 
computing systems to perform satisfactorily in operation. 

Software and hardware are two major building blocks in computing systems. 
They have to work together successfully to complete many critical computing 
tasks. This book systematically studies the reliability of software, hardware and 
integrated software/hardware systems. It also introduces typical models in the 
reliability analysis of the distributed/networked systems, and then further 
develops some new models and analytical tools. 

“Grid'’ computing system has emerged as an important new field, 
distinguished from conventional distributed computing systems by its focus on 
large-scale resource sharing, innovative applications, and, in many cases, high- 
performance orientation. This book also presents general reliability models for 
the grid and discusses analytical tools to estimate the grid reliability related to 
the resource management system, wide-area network communication, and 
parallel running programs with multiple shared resources. 
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Furthermore, this book introduces the basic reliability theories and models 
for various multi-state systems. Based on the models, some interesting decision 
problems in system design and resource allocation are further discussed. 

This book is organized as follows. 

Chapter 1 provides an introduction to the field of computing systems and 
reliability analysis. Simple reliability concepts are also discussed. Chapter 2 
provides the basic knowledge in reliability analysis and summarizes some 
common techniques for analyzing the computing system reliability. The 
fundamentals of Markov processes and Nonhomogeneous Poisson processes 
(NHPP) are also introduced, which are essential tools used in this book. 

Chapters 3 and 4 present important models for the reliability analysis of 
hardware and software systems, respectively. They are useful when hardware 
and software issues are dealt with separately at the system analysis stage. 
Chapter 5 discusses the models for integrated systems. This is essential in 
computing system analysis as both software and hardware systems have to work 
together. 

In Chapter 6, the reliability of various distributed computing systems 
which incorporate the network communication into the hardware/software 
reliability is studied. The distributed computing system is a common and 
widely-used networked system and hence a chapter is devoted to this. 

The reliability of grid computing systems, which is a new direction in 
computing technology, is studied in Chapter 7. Since the grid reliability is 
difficult to evaluate due to its wide-area, heterogeneous and time various 
characteristics, we initially construct the reliability models for the different parts 
of the grid, including resource management system, large-scale network, 
distributed software and resources. 

Finally, Chapter 8 studies the multi-state system reliability. Some 
optimization models in the system design and resource allocation are presented 
in Chapter 9. This is an area where research is going on and further development 
is needed. 
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The basic chapters in this book are Chapters 3-7. Readers familiar with 
basic reliability can start from Chapter 3 directly. Chapters 8 and 9 are on 
advanced topics and can be read by those interested in those specific topics. 

Many models and results found in the literature and from our research are 
presented in the book. It is hoped that these approaches are easily implemented 
by practitioners as well. In addition, many examples are accompanied with those 
approaches. 

The book serves as reference book for students, professors, engineers and 
researchers in related science and engineering field. It can be used for graduate 
and senior undergraduate courses. Researchers and students should find many 
ideas useful in dieir academic work. 

The readers should have some basic knowledge in probability and calculus. 
However, difficult details are omitted to benefit the general audience. 
References are given so that further details can be found for those who are 
interested in more specific results. 



M. Xie 
Y. S. Dai 
K. L. Poh 
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CHAPTER 




INTRODUCTION 



1. 1. Need for Computing System Reliability Analysis 

Computing has been the fastest developing technology during the last century. 
Computing systems are widely used in many areas, and they are desired to 
achieve various complex and safety-critical missions. The applications of the 
computing systems have now crossed many different fields and can be found in 
different products, for example, air traffic control systems, nuclear power 
plants, aircrafts, real-time military systems, telephone switching, bank 
auto-payment, hospital patient monitoring systems, and so forth. 

The size and complexity of the computing systems has increased from one 
single processor to multiple distributed processors, from individual-separated 
systems to networked-integrated systems, from small-scale program running to 
large-scale resource sharing, and from local-area computation to global-area 
collaboration. A computing system today may contain many processors and 
communication channels and it may cover a wide area all over the world. They 
combine both software and hardware that have to function together to complete 



1 





2 



Introduction 



various tasks. They may incorporate multiple states and their failures may be 
correlated with one another. These factors make the system modeling and 
analysis complicated. As a result, making decisions in the system design or 
resource allocation also becomes difficult accordingly. 

There is no common approach to assess computing systems. Reliability is a 
quantitative measure useful in this context as reliability can be broadly 
interpreted as the ability for a system to perform its intended function. Intensive 
studies on reliability models and analytical tools are carried out to improve the 
chance that the computing systems will perform satisfactorily in operations. As 
the functionality of computing operations becomes more essential, there is a 
greater need for a high reliability of the computing systems. 

In fact, in order to increase the performance of the computing systems and 
to improve the development process, a thorough analysis of their reliability is 
needed. Based on the models and analysis, approaches to improve system 
reliability can be further implemented. 



1.2. Computing System Reliability Concepts 

In general, the basic reliability concept is defined as the probability that a 
system will perform its intended function during a period of running time 
without any failure (Musa, 1998). A failure causes the system performance to 
deviate from the specified performance. 

A fault is an erroneous state of the system. Although the definitions of fault 
are different for different systems and in different situations, a fault is always 
an existing part in the system and it can be removed by correcting the erroneous 
part of the system. For the computing systems, the basic reliability concept can 
be adapted to some specific forms such as “software reliability”, “system 
reliability”, “service reliability”, “system availability”, etc., for different 
purposes. 
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Most computing systems contain software programs to achieve various 
computing tasks. Software reliability is an important metric to assess the 
software performance. Similar to the general reliability concept, software 
reliability is defined as the probability that the software will be functioning 
without failure under a given environmental condition during a specified period 
of time (Xie, 1991). Here, a software failure means generally the inability of 
performing an intended task specified by the requirement. 

Software reliability is only a measurement of software program. In order to 
assess the computing system that may contain multiple software programs and 
hardware components, system reliability is commonly used. It is defined as the 
probability that all the tasks for which the system is desired can be successfully 
completed (Kumar et al„ 1986). Those software programs may be in parallel or 
serial and they may even have any arbitrarily distributed structure. The system 
reliability needs to be computed in a different way according to the system 
structure. 

Some computing systems are developed to provide different services for 
the users. The users may only be concerned with whether the service they are 
using is reliable or not. From the users' point of view, service reliability is an 
important measure, and it is defined as the probability for a given service to be 
achieved successfully. This is a useful concept in service quality analysis, and it 
broadens the traditional reliability definition. 



1.3. Approaches to Computing System Modeling 

Computing system reliability is an interesting, but difficult, research area. 
Although there are many reliability models suggested and studied in the 
literature, none can be used universally, and there is no unique model which 
can perform well in all situations. The reason for this is that the assumptions 
made for each model are correct or are good approximations of the reality only 
in specific cases. 
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In the computing systems, hardware (such as computers, routers, 
processors, CPUs, memories, disks, etc.) provides the fundamental 
configurations to support computing tasks. Many traditional reliability models 
mainly dealt with the hardware reliability, such as Barlow & Proschan (1981), 
Elsayed (1996) and Blishcke & Murthy (2000). 

Software is another important element in the computing systems besides 
the hardware. Different from the hardware, the software does not wear-out and 
it can be easily reproduced. Furthermore, software systems are usually 
debugged during testing phase so that its reliability is improving over time. 
Many software reliability models have been proposed for the study of software 
reliability, see e.g., Xie (1991), Lyu (1996), Musa (1998) and Pham (2000). 

However, a computing system usually includes not only a hardware 
subsystem but also a software subsystem, which ought not to be separately 
studied. Both software and hardware failures should be integrated together in 
analyzing the performance of the whole system. Many reliability models for the 
integrated software and hardware systems have been recently presented, such 
as Goel & Soenjoto (1981), Siegrist (1988), Laprie & Kanoun (1992), Dugan & 
Lyu, (1994), Welke et al., (1995) and Lai et al. (2002). Although there are some 
books that contain discussion on integrated software and hardware system 
reliability, this book is entirely devoted to this topic and the associated issues. 

Accompanying the development of network techniques, many computing 
systems need to communicate information through the (local or global) 
networks. The programs and resources of such systems are distributed all over 
the different sites connected by the networks. This kind of computing system is 
usually called distributed computing system. The performance of a distributed 
computing system is determined not only by the software/hardware reliability 
but also by the reliability of the networks for communication. Many models 
and algorithms have been presented for the distributed system reliability, see 
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e.g. Hariri et al. (1985), Kumar et al. (1986), Chen & Huang (1992), Chen et 
al. (1997), Lin et al. (1999, 2001) and Dai et al. (2003a). 

As a special type of the distributed computing systems, grid computing is a 
recently developed technique by its focus on various shared resources, 
large-scale networks, wide-area communications, real-time programs, diverse 
virtual organizations, heterogeneous platforms etc. Many experts believe that 
the grid computing systems and technologies will offer a second chance to 
fulfill the promises of the Internet, see e.g. Foster & Kesselman (1998). 
Although it is difficult to study due to its complexity, the reliability of the grid 
computing systems begins to be of concern today. 

Most of reliability models for computing systems assume only two 
possible states of the system. In reality, many computing systems may contain 
more than two states (Lisnianski & Levitin, 2003), especially for those 
real-time systems. For example, if some computing elements in a real-time 
system fail, the system may still continue working but its performance should 
be degraded. Such a degradation state is another state between the perfect 
working and completely failed states. To study these types of systems, the 
Multi-State system reliability is also of concern recently to many researchers, 
e.g. Brunelle & Kapur (1999), Pourret et al. (1999), Levitin et al. (2003) and 
Wu & Chan (2003). 

The book provides a systematic and comprehensive study of different 
reliability models and analytical tools for various computing systems including 
hardware, software, integrated software/hardware, distributed computing, grid 
computing, multi-state systems etc. Some interesting optimization problems for 
system design and resource allocation are further discussed. Many examples 
are used to illustrate to the use of these models. 
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BASIC RELIABILITY 
CONCEPTS AND ANALYSIS 



Reliability concepts and analytical techniques are the foundation of this book. 
Many books dealing with general and specific issues of reliability are available, 
see e.g., Barlow & Proschan (1981), Shooman (1990), Hoyland & Rausand 
(1994), Elsayed (1996), and Blischke & Murthy (2000). Some basic and 
important reliability measures are introduced in this chapter. Since computing 
system reliability is related to general system reliability, the focus will be on tools 
and techniques for system reliability modeling and analysis. Since Markov 
models will be extensively used in this book, this chapter also introduces the 
fundamentals of Markov modeling. Moreover, Nonhomogeneous Poisson Process 
(NHPP) is widely used in reliability analysis, especially for repairable systems. 
Its general theory is also introduced for the reference. 



2. 1. Reliability Measures 

Reliability is the analysis of failures, their causes and consequences. It is the most 
important characteristic of product quality as things have to be working 
satisfactorily before considering other quality attributes. Usually, specific 
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performance measures can be embedded into reliability analysis by the fact that if 
the performance is below a certain level, a failure can be said to have occurred. 



2.1.1. Definition of reliability 

The commonly used definition of reliability is the following. 

Definition 2.1. Reliability is the probability that the system will perform its 
intended function under specified working condition for a specified period of 
time. 

Mathematically, the reliability function R(t) is the probability that a system will 
be successfully operating without failure in the interval from time 0 to time t, 

R(t) = P(T>t ), t> 0 (2.1) 

where T is a random variable representing the failure time or time-to-failure. 

The failure probability, or unreliability, is then 

F(t) = 1 - R(t) = P(T < t) 
which is known as the distribution function of T. 

If the time-to-failure random variable T has a density function /(f), then 

oo 

R{t) = ff(x)dx 

t 

The density function can be mathematically described as 

lim P{t < T < t + At) . This can be interpreted as the probability that the failure 
A/-*o 

time T will occur between time t and the next interval of operation, t + At . The 
three functions, R(t), F(t ) and fit) are closely related to one another. If any of 
them is known, all the others can be determined. 
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2.1.2. Mean time to failure (MTTF) 

Usually we are interested in the expected time to next failure, and this is termed 
mean time to failure. 

Definition 2.2. The mean time to failure (MTTF) is defined as the expected value 
of the lifetime before a failure occurs. 

Suppose that the reliability function for a system is given by R(t), the MTTF 
can be computed as 

00 00 

MTTF= Jr • f(t)dt = jR(t)dt (2.2) 

o o 



Example 2.1. If the lifetime distribution function follows an exponential 
distribution with parameter A, that is. F(t ) = 1 - exp(-/ir) , the MTTF is 

MTTF= jR(t)dt = jexp(-At)dt = - (2.3) 

oo 

This is an important result as for exponential distribution. MTTF is related to a 
single model parameter in this case. Hence, if MTTF is known, the distribution is 
specified. 



2.1.3. Failure rate function 

The failure rate function, or hazard function, is very important in reliability 
analysis because it specifies the rate of the system aging. The definition of failure 
rate function is given here. 



Definition 2.3. Th s failure rate function Mt) is defined as 
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Ai— >o AtR(t) R(t) 



(2.4) 



The quantity A {t)dt represents the probability that a device of age t will fail in 
the small interval from time t to t + dt. The importance of the failure rate 
function is that it indicates the changing rate in the aging behavior over the life of 
a population of components. For example, two designs may provide the same 
reliability at a specific point in time, but the failure rate curves can be very 
different. 



Example 2.2. If the failure distribution function follows an exponential 
distribution with parameter A , then the failure rate function is 



A (t )=M s ±**E = A 

R(0 exp (-At) 



(2.5) 



This means that the failure rate function of the exponential distribution is a 
constant. In this case, the system does not have any aging property. This 
assumption is usually valid for software systems. However, for hardware 
systems, the failure rate could have other shapes. 



2.1.4. Maintainability and availability 

When a system fails to perform satisfactorily, repair is normally carried out to 
locate and correct the fault. The system is restored to operational effectiveness by 
making an adjustment or by replacing a component. 
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Definition 2.4. Maintainability is defined as the probability that a failed system 
will be restored to a functioning state within a given period of time when 
maintenance is performed according to prescribed procedures and resources. 



Generally, maintainability is the probability of isolating and repairing a fault in a 
system within a given time. Maintenance personnel have to work with system 
designers to ensure that the system product can be maintained cost effectively. 

Let T denote the time to repair or the total downtime. If the repair time T has 
a density function git) , then the maintainability, V(t) , is defined as the 
probability that the failed system will be back in service by time t, i.e., 

t 

V{t) = P{T<t)=\g{x)dx 
o 

An important measure often used in maintenance studies is the mean time to 
repair (MTTR) or the mean downtime. MTTR is the expected value of the repair 
time. 

Another important reliability related concept is system availability. This is a 
measure that takes both reliability and maintainability into account. 



Definition 2.5. The availability function of a system, denoted by A(t ) , is 
defined as the probability that the system is available at time t. 



Different from the reliability that focuses on a period of time when the system is 
free of failures, availability concerns a time point at which the system does not 
stay at the failed state. Mathematically, 

A(t) = Pr(System is up or available at time instant t) 

The availability function, which is a complex function of time, has a simple 
steady-state or asymptotic expression. In fact, usually we are mainly concerned 
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with systems running for a long time. The steady-state or asymptotic availability 
is given by 



,,, System up time 

A=limA(0 = - : — 

/ — >00 System up time + System down time 



MTTF 

MTTF + MTTR 



The mean time between failures (MTBF) is another important measure in 
repairable systems. This implies that the system has failed and has been repaired. 
Like MTTF and MTTR, MTBF is an expected value of the random variable time 
between failures. Mathematically, MTBF=MTTR+ MTTF. 



Example 2.3. If a system has a lifetime distribution function F(t) = 1 - exp(-/it) 
and a maintainability function V{t) = 1 - exp(-/tf) , then MTTF=l//i and 
MTTR=l//t. The MTBF is the sum of MTTF and MTTR and the steady-state 
availability is 

MTTF _ IU _ ft 
~ MTTR + MTTF ” MX + lift ~ + M 



2.2. Common Techniques in Reliability Analysis 

There are many techniques in reliability analysis. The most widely used 
techniques in computing systems are reliability block diagrams, network 
diagrams, fault tree analysis and Monte Carlo simulation, which will be 
introduced in the following sections. Another popular and important analytical 
tool, Markov model, will be introduced in Section 2.3 since it is the main 
technique used in this book. 
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2.2.1. Reliability block diagram 

A reliability block diagram is one of the conventional and most common tools of 
system reliability analysis. A major advantage of using the reliability block 
diagram approach is the ease of reliability expression and evaluation. 

A reliability block diagram shows the system reliability structure. It is made 
up of individual blocks and each block corresponds to a system module or 
function. Those blocks are connected with each other through certain basic 
relationships, such as series and parallels. The series relationship between two 
blocks is depicted by Fig. 2. 1 (a) and parallel by Fig. 2. 1 (b). 




(a) Series blocks (b) Parallel blocks 

Fig. 2.1. Basic relationships between two blocks. 



Suppose that the reliability of a block for module i is known or estimated, and it 
is denoted by R t . Assuming that the blocks are independent from a reliability 
point of view, the reliability of a system with two serially connected blocks is 

*, = *,•* 2 (2.6) 

and that of a system with two parallel blocks is 

r p =i-na-*,) (2.7) 

i=l 

The blocks in either series or parallel structure can be merged into a new block 
with the reliability expression of the above equations. Using such combinations, 
any parallel-series system can be eventually merged to one block and its 
reliability can be easily computed by repeatedly using those equations. 
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Example 2.4. A parallel-series system consists of five modules whose reliability 
block diagram is shown as Fig. 2.2(a). The parallel blocks can be merged as 
shown by Fig. 2.2(b). It can be further merged into one block simply through the 
series expression (2.6). The combined reliability expression is given under the 
new blocks. 




(a) Five parallel-series connected modules 




1-no-*,) 

/■i 



i-W-v 

;«4 



(b) The merged blocks 

Fig. 2.2. Reliability computation of a parallel-series system. 



Furthermore, a library for reliability block diagrams can be constructed in order 
to include other configurations or relationships. Additional notational description 
is needed and specific formulas for evaluating these blocks must be obtained and 
added to the library. One such example is the simple k-out-of-n in the following. 
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Example 2.5. A k-out-of-n system requires that at least k modules out of a total 
of n must be operational in order for the system to be working. Usually a voter is 
needed, see Fig. 2.3. 




Fig. 2.3. The /r-out-of-n configuration representing by blocks. 



If the voter is perfect and all the modules have reliability R , the formula to 
evaluate the reliability of these blocks, which can be obtained via conditioning or 
binomial distribution (Barlow & Proschan, 1981), is 




n\ 



i!(« — *)! 






( 2 . 8 ) 



A majority voting system requires more than half of modules to be operational. 
The reliability of such a system is given by 



s 



n\ 






R‘(l-R) n 



where [X ] denotes the largest integer that is less than or equal to X. 
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2.2.2 Network diagram 

Network diagrams are commonly used in representing communication networks 
consisting of individual links. Most network applications are in the 
communication domain. The computation of network reliability is the primary 
application of network diagrams. 

The purpose of a network is to execute programs by connecting different 
sites that contain processing elements and resources. For simple network 
diagrams, computation is not complex and reliability block diagrams can 
alternatively be used. For example, Fig. 2.4 shows the network diagrams that are 
connected through series or parallel links. 

Path A Path B Path A 



Path B 

(a) Series connected links (b) Parallel connected links 

Fig. 2.4. Network diagram representing series and parallel two links. 





Fig. 2.4 can alternatively be represented by the reliability block diagrams if we 
view each link as a block, depicted by Fig. 2.1. 

The choice of reliability block diagram or network diagram depends on the 
convenience of their usage and description for certain specific problems. Usually, 
the reliability block diagram is mainly used in a modular system that consists of 
many independent modules and each module can be easily represented by a 
reliability block. The network diagram is often used in networked system where 
processing nodes are connected and communicated through links, such as the 
distributed computing system, local/wide area networks and the wireless 
communication channels, etc. 
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2.2.3. Fault tree analysis 

Fault tree analysis is a common tool in system safety analysis, it has been adapted 
in a range of reliability applications. 

A fault tree diagram is the underlying graphical model in fault tree analysis. 
Whereas the reliability block diagram is mission success oriented, the fault tree 
shows which combinations of the component failures will result in a system 
failure. The fault tree diagram represents the logical relationships of ‘AND’ and 
‘OR’ among diverse failure events. Various shapes represent different meanings. 
In general, four basic shapes corresponding to four relationships are depicted by 
Fig. 2.5. 




Input event ‘and’ gate ‘or’ gate Output/Top event 

Fig. 2.5. Basic shapes of fault tree diagram. 

Since any logical relationships can be transformed into the combinations of 
‘AND’ and ‘OR’ relationships, the status of output/top event can be derived by 
the status of input events and the connections of the logical gates. 



Example 2.6. An example of a fault tree diagram corresponding to the reliability 
block diagram in Example 2.4 is shown by Fig. 2.6. As the fault tree shows, the 
top-event of the system fails if both module 1 and 2 fail, or module 3 fails, or 
both module 4 and 5 fail. 
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Fig. 2.6. Fault tree for Five modules. 



A fault tree diagram can describe the fault propagation in a system. However, 
complex systems may exhibit much more complex failure behavior, including 
multiple failure modes and dependent failure modes. These failures will have 
different effects on the mission outcome. The basic fault tree analysis does not 
support this type of modeling. Moreover, repair and maintenance are two 
important operations in system analysis that cannot be expressed easily using a 
fault tree formulation. 



2.2.4. Monte Carlo simulation 

In a Monte Carlo simulation, a reliability model is evaluated repeatedly using 
parameter values drawn from a specific distribution. The Monte Carlo simulation 
is often used to evaluate the MTBF for complex systems. Here, the following 
steps apply: 
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1) Simulate random numbers for each random variable needed in the 
simulation model. 

2) Evaluate the desired function. 

3) Repeat steps 1 and 2 a total of n times, to obtain n samples of the desired 
function. For example, the system failure times will be T( 1) , 
T(2),...,T(n). 

4) Estimate the desired parameter. For example, the expected value of the 
system failure time can be obtained from 

E(T) = MTBF = -YT(i) 

5) Obtain an estimate of the precision of the estimate, such as the sample 
standard deviation of the estimated value. 

Monte Carlo simulation can handle a variety of complex system 
configurations and failure rate models. However, Monte Carlo simulation usually 
requires the development of a customized program, unless the system 
configuration fits a standard model. It also requires lengthy computer runs if 
accurate and converging computations are desired. 



2.3. Markov Process Fundamentals 

Markov model is another widely used technique in reliability analysis. It 
overcomes most disadvantages of other techniques and is more flexible to be 
implemented in reliability analysis for various computing systems, which will be 
applied in the later chapters. 
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2.3.1. Stochastic processes 

When we examine the evolution of a process governed by the rules of 
probability, we observe a stochastic process. The study of stochastic processes 
involves the analysis of a collection of random variables, their interdependence, 
their change over time, and limiting behavior, among others (Ross, 2000). 

In the study of stochastic processes, it is useful to establish two distinct 
categories: 

1) Stationary: A stationary process is one for which the distribution remains 
the same over time. 

2) Evolutionary ( Nonstationary ): An evolutionary process can be defined as 
one that is not stationary and the process evolves with time. 

Almost all systems are dynamic in nature. Markov model is a powerful tool 
to solve such dynamic problems. Its stochastic process is a sequence of outcomes 
X t , where t takes value from a parameter space T. 

If the parameter space T is discrete and countably finite, the sequence is 
called a discrete-time process and is denoted by { X n } where n= 1 ,2,.... The 
index n identifies the steps of the process. On the other hand, if the parameter 
space T is continuous or uncountable, the sequence is called a continuous-time 
process and is denoted by {X n t£ [0,°o)} . 

The set of all possible and distinct outcomes of all experiments in a 
stochastic process is called its state space and normally is denoted by £2 . Its 
elements are called the states. If the state space £2 is discrete, then the process 
is called a discrete- state process. Otherwise, it is called continuous-state process. 
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2.3.2. Standard Markov models 

There are four types of standard Markov models corresponding to four types of 
Markov processes classified according to their state-space and time 
characteristics as Table 2.1 shows below. 



Table 2.1. Four types of Markov processes. 



Type 


State Space 


Time Space 


1 


Discrete 


Discrete 


2 


Discrete 


Continuous 


3 


Continuous 


Discrete 


4 


Continuous 


Continuous 



The standard Markov models satisfy the Markov property, which is defined here. 



Definition 2.6. For a stochastic process that possesses Markov property, the 
probability of any particular future behavior of the process, when its current state 
is known exactly, is not changed by additional information concerning its past 
behavior. 



These four Maikov models are described in more details in the following 
sections. 



Discrete-Time Markov chain 

The discrete-state process is referred to as chain, so the discrete- state and 
discrete-time Markov process is usually called discrete time Markov chain 
(DTMC). 
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A general discrete-time chain is a sequence of discrete random variables 
{X„,n = l,2,...} , in which X^ +1 is dependent on all previous outcomes 
Xq,X|,...,X£ . The analysis of this type of chain can easily become 
unmanageable, especially for long-term evaluation. Fortunately, in many 
practical situations, the influence of the earlier outcomes on its future one tends 
to diminish rapidly with time. 

For mathematical tractabiiity, we can assume that X n+1 is dependent only 
on i previous outcomes, where i > 1 is a fixed and finite number. In this case, 
deriving Pr{X n+1 = ;} requires only the information about the previous i 
outcomes (from step n-i + 1 to step n), i.e., 

P r {^n+l ~ j 1^0 =l 0'^t = , l»— .X„ =/} 

= Pr{X n+1 = j | X n _ M = t n _ 1+ 1 , X n _ i+2 = i n -i+2 1— » X n = i} (2.9) 

We call this type of chain a Markov chain of order i. 

We usually refer to the first-order Markov chain simply as a Markov chain. 
For these chains, only their present (at time n ) has any influence on their future 
(at time n+ 1 ). In other words, for all ;;>(), 

Pr{*„ + , = j | X 0 = i 0 ,X l = i X n = i}=Pr{X n+I =j | X„ = /} (2. 1 0) 

The essential characteristic of such a Markov process can be thought of as 
memoryless. 

For the right-hand side of the above equation, it is assumed that the state 
space SI under consideration is either finite or countably infinite. Define 

Fy(«.« + l) = Pr{X n+1 = j\X n =i},n=0,l,... (2.11) 

The conditional probability py(n,n + l) is called the (one-step) transition 
probability from state i to state j at time n. The m-step transition probabilities at 
time n are defined by 




Computing System Reliability 



23 



Pjj(n,n + m) = Pr{X„ +m =j\X n = /} , n=0,l,... (2.12) 

and the corresponding wt-step transition matrix at time n is P (n,n+m). The 
transition matrix should satisfy, 

P(m,n)= P(m,[) P(/,n), m<l<n (2.13) 

or, equivalently, 

fy(m,n) = £p lt (m,/)p tj (/,n), m<l<n (2.14) 

* 

This equation is known as the Chapman-Kolmogorov equation (Ross, 2000). 

Example 2.7. Suppose that a computing system has three states after each run. 
The states are perfect, degraded, and failed states denoted by state 1, 2 and 3. The 
state of the current run will just affect the state of the next run. The matrix of one 
step transition probability is 

0.7 0.2 0.1' 

P= 0.3 0.5 0.2 
0.1 0.3 0.6 

This is a discrete time, discrete state Markov chain (DTMC) that is depicted 
by the transition graph in Fig. 2.7. 

According to the Chapman-Kolmogorov equation, the two-step transition 
matrix can be obtained as 

"0.7 0.2 0.1] 2 [0.56 0.27 0.17' 

P(0,2) = P(0,l)x P(l,2) = 0.3 0.5 0.2 = 0.38 0.37 0.25 

0.1 0.3 0.6J [0.22 0.35 0.43 
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Fig. 2.7. DTMC for the three-state system transitions. 



Thereafter, if the system initially stays at a perfect state, then the probability that 
the system still stays at that state after 2 runs should be p, ( (0,2) = 0,56, The 
four-step transition matrix is 





'0.56 


0.27 


0.17" 


2 


' 0.236' 


P(0,4) = P(0,2)x P(2,4) = 


0.38 


0.37 


0.25 


= 






0.22 


0.35 


0.43 







The probability that the system does not stay at the failed state after 4 runs is 

1 - p, 3 (0,4) = 1-0.236 = 0.764 



Continuous-time Markov chain 

Similar to the case of DTMC, the discrete-state and continuous-time Markov 
process is usually called the continuous time Markov chain. Let the time space 
T =[0,°o) be an index set and consider a continuous-time stochastic process 
(X(0,/ >0} taking values on the discrete state space Q. We say that the 
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process {*(/).* >0} is a Markov chain in continuous time if, for each s > 0 , 
t > 0 and each set A, we have 

Pr{X(f + s)e A|X(K),0<u<s}=Pr{X(f + s)e A|X(j)J 

Specifically, if, for each S >0 , t > 0 and each i, j€ £2 , and every history 
x(u) , 0 < u < s , 

Pr{X(i + j) = j\X(s) = i, X(u), 0 <u< i} = Pr{X(? + s) = ; | X(s) = i) (2.15) 

then the process {X(t) } is called a continuous-time Markov chain (CTMC). 

A CTMC is a stochastic process having the Markov property that the 
conditional distribution of the future state, given the present state and all past 
states, depends only on the present state and is independent of the past. Also, 
define 



p„(5,0 = Pr{X(0 = y|X(j) = i}, o <s<t (2.16) 

The conditional probability p ( j (sj) is called the transition probability function 
from state i to state j and the matrix Pt.vU) is called the transition matrix function. 

Similar to the DTMC, we have the Chapman- Kolmogorov equation as 

Pij(s, t ) = '£ j p ik (s,u)p kj ( u ,t), 0 <s<u<t (2.17) 

k 

In matrix notation, this can be written as 

P(s,/)= P(j,m) P(u,/), 0 < s < u < t (2. 1 8) 

The above equation can be compared with its discrete-time counter-part (2.13) or 
(2.14). 

When the transition probability functions p (J {s,t) depend only on the 
difference A? =t-s, i.e., 



p tj (A/) = Pr{ X (At + j) = j | X (j) = i ) , 0 < s < t , for all i, je£l, 
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the continuous-time Markov chain {X(t)} is said to be homogeneous. For any 
homogeneous Markov chain, the Chapman-Kolmogorov equation is expressed as 

P ij (s + t) = Y J Pik( s ')Pkj( t ), s,t> 0 (2.19) 

k 

This can be written in matrix form as 

P(s+f)= P(i) P(t), s,(>0 (2.20) 



where P(t)=(p,j«)) which satisfies 



P(/-^)= P(i,f) for t > s > 0 

As given by Kijima (1997, p. 174), the derivative of Pit) is defined as 

~h 



P'(/)=[P(t + h)-P(t)] 



JP (u)du 



( 2 . 21 ) 



which shows that P(t) is infinitely differentiable with respect to t> 0. 

Define Q - P'(0+). The matrix Q= { q t j } is called infinitesimal generator, or 
generator for short. This is of fundamental importance in the theory of CTMC. 
Since P(0)=I, we have 



PiiW 



lim ~y~ 

h—*Q+ h 



>0, i*j 



9 ( 1=1 



[ /i-*o+ h 



( 2 . 22 ) 



Since P(f) is differentiable, it follows from (2.22) that 

P’(1)=P(0 Q,/>0 



(2.23) 





Computing System Reliability 



27 



which are the systems of ordinary linear differential equations. The former is 
known as the backward Kolmogorov equation and the latter as the forward 
Kolmogorov equation (Ross, 2000). 



Example 2.8. Suppose that a computing system has two states: Good and Failed, 
denoted by 1 and 2, respectively. Suppose that the transition from state i to j 
follow a continuous time distribution, say the exponential distribution, 

Fy (0 = 1- exp (~A 0 t) , ij= 1,2 

The CTMC is depicted in Fig. 2.8. 




Fig. 2.8. CTMC for the two-state system. 



From the exponential distribution, we have 

Py(fr) = l-exp(-A y /t) , i*j 



Then, q tJ can be written as Eq. (2.22) for (i t j) 



Pij (h) 1-exp (-A tJ h) 

q„ = lim — — = lim - — 

1 h-> 0+ h A->0+ h 



exp(-/Lt ) - exp { -Ay (t + h)} 

lim 

a— > o+ h ■ exp(-/lyt) 



(2.24) 



(2.25) 



Let Ry (t) = exp(-Ayt) . We have, 
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RJD-RJt + h) 

Rij = lim — 

1 *-> o+ 



h-Ryit) 



R ij0) 









(2.26) 



This solution is useful, and it implies that for exponential distribution, the q ( j is 
equal to its rate. 

Then, the Chapman-Kolmogorov equation for Fig. 2.8 can be written as 

P x \t) = ^P 2 {t)-A i2 P x (t) (2.27) 



and 



p 2 '(t)=^ 2 m-^P2(t)- 



(2.28) 



With the initial condition (assume the system initially stays at the good state) 

P,(0) = 1,P 2 (0) = 0 (2.30) 



we obtain the availability function as 

Mt)=m= 



— — expf-tfj, +4 2 )/} + — 



•^21 "*■ \ 



& 2 \ +^]2 



(2.31) 



Discrete Time, Continuous State 

The discrete-time continuous-state Markov model is applicable if there are 
discrete changes in time in an environment where the states of the system are 
continuous over a specified range. 

It is easy to see how the concept could be applied to the component 
parameter drift problem. However, little work has been done in this area, and 
multi-parameter modeling and computation remain a difficult problem. There are 
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two possible reasons: numerical data are seldom available, and the solution of the 
resulting partial differential equations is more complex. 



Continuous Time, Continuous State 

The conventional diffusion equations fall in this category of continuous-time and 
continuous-state Markov models. Usually when we talk about the system state 
space, we attempt to describe it in fixed terms. In reliability, we talk about fully 
operational systems or failed systems. Once we introduce the concept of 
degraded operability, it is easy to imagine a continuum of physical states in 
which the system can exist. There could be some other advanced applications. 
However, the evaluation of these equations will be costly and more involved. 

Since little work has been done in the area of the continuous state (Type 3 and 
4 in Table 2.1), the continuous-state Markov process will not be discussed in this 
book. For details about them, the readers can refer to Kijima (1997). 



2.3.3. Some non-standard Markovian models 

Some important aspects of system behavior cannot be easily captured in certain 
types of the above Markov models. The common characteristic these problems 
share is that the Markov property is not valid at all time instants. This category of 
problems is jointly referred to as non-Markovian models and can be analyzed 
using several approaches, see e.g., Limnios & Oprisan (2000). 



Markov renewal sequence 

We first introduce the renewal process. Let S 0 < 5j < S 2 < ■ ■ ■ be the time 
instants of successive events to occur. The sequence of non-negative independent 
and identically distributed random variables, S= —S n ^',n = 1,2,...} is a 
renewal process. 
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The idea of having the times S n - S n _ { depend on a state which can be 
generalized. We can assume that there is a set of states £2 , which can be thought 
of as the set 0, 1,..., as before. The state at S n is given by X n £ £1 . The chain 
X n now forms a process on its own. In particular, they may form a DTMC. The 
points S n ; rt=0,l,2,..., are called Markov regeneration epochs, or Markov 
renewal moments. Together with the states of the embedded Markov chain X n , 
they define a Markov renewal sequence. 



Definition 2.7. The bivariate stochastic process ( X,S ) = {X n ,S n ’,n = 1,2,...} is a 

Markov renewal sequence provided that 

Pr{AT n+ , =j,S n+i -S n </}AT 0 X n ;5 0 ,...,5 n } 

= Pr{x n+1 = 7 ',S n+1 -S„<fj X„}, n - 0,1,2 jeQ,t> 0 (2.32) 

The random variables S n are the regeneration epochs, and the X n are the 
states at these epochs. 



Markov renewal sequences are embedded into Markov Renewal Models. Markov 
renewal models can be classified into two categorizations called semi-Markov 
model and Markov regenerative model. 

Semi-Markov process 

A possible generalization of the CTMC is to allow the holding time to follow 
general distributions. That is, by letting F/(t) be the holding-time distribution 
when the process is in state i, we can construct a stochastic process {X(/)j as 
follows. If X(0) = i , then the process stays in state i for a time with distribution 
function At the end of the holding time, the process moves to state j, 

which can be equal to i, according to the Markovian law P= { p tJ ) . The process 
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stays in state j for a time with distribution function F } ( t ) and then moves to 
some state according to P. Under some regularity conditions, we can constmct a 
stochastic process by repeating the above procedure. 

We can introduce more dependent structure into the holding times. Namely, 
when X (0) = i , we choose the next state j and the holding time simultaneously 
according to a joint distribution Fy(t) . Given the next state/, the holding-time 
distribution is given by Fy (/)/ F tJ (°°) . After the holding time, a transition to state 
j occurs. At the same time, the next state k as well as the holding time is 
determined according to a joint distribution Fj k (l) . A stochastic process 
constructed in this way is called a semi-Markov process. 



Definition 2.8. Let £2 denote the state space and let {Y n } be a sequence of 
random variables taking values on £2 . Let (V n } be a sequence of random 
variables taking values on /? + = and let 

fl -1 

r n ’ n=1 ’ 2 ’ ■’ W ' th T 0 SO 

*= 0 

We define T(t) = max{« : T n <f} , t> 0, the renewal process associated with 
{V„) . Thereafter, with the above notation, suppose that 

Pr {Y n+l =J,V n <t\Y 0 ,-Y n =f,V 0 ,-V n - l )=Pc{Y n+l =j,V n <t\Y n =/} (2.33) 

for all n=0,l,...; /, / 6 £2 , and f>0. Then the stochastic process { X(t) } 
defined by X (t) = Y r(l) , t > 0 , is called a semi-Markov process. 



For a semi-Markov process, the time distribution Fy ( t ) satisfies the following 
equation. 

F ij U) = X Pik Pkj ' F ik ( f ) ® F k) (0 

k 



(2.34) 
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where ’<8> ' denotes the convolution of the two functions, defined as 

i 

F{t) ® G(t) = jF(i)G(t - s)ds . (2.35) 

o 



Using the Laplace-Stieltjes Transform, the above equation can be simplified 
as 

k 

where Fy(s) is the Laplace-Stieltjes transform of Fy{t) . 



Markov regenerative process 

The Markov regenerative model combines the Markov regenerative process into 
its modeling. A stochastic process Z = (Z,;t> 0} with the state space £2 is 
called regenerative if there exist time points at which the process probably 
restarts itself. The formal definition of Markov regenerative process is given now. 



Definition 2.9. A Markov regenerative process is defined as a stochastic process 
(Z,;f >0) , Z, eQ with an embedded Markov regenerative process {X, S ), 
X n E F, which has the additional property that all conditional finite distributions 
of {Z Sn+l \t >0} , given {Z H ;0< u < S n , X n =i) , are the same as those of 
{Z,;/>0}, given X 0 =i. 

As a special case, the definition implies that for iE F , je£2, 

Pr{z Sn+l =j\Z u ,0<u<S n ,X n =t}=Pr{z, =y]X 0 =i} (2.36) 

The expression in Eq. (2.36) implies that the Markov regenerative process does 
not have the Markov property in general, but there is a sequence of embedded 
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time points S 0 ,S i ,...,S n ,... , such that the states X 0 ,X l3 ...,X n ,... realized at these 
points satisfy the Markov property. It also implies that the future of the process Z 
from time f = S n onwards depends on the past {Z w ,0<H<S n } only through 

*„• 

Different from the semi-Markov processes, state changes in Markov 
regenerative process may occur between two consecutive Markov regeneration 
epochs. An example of Markov regenerative process is illustrated below. 

Example 2.9. Suppose that a system has two states: 0 and 1 (good and failed). 
When the system fails, it is restarted immediately. After restarting, the system 
may stay at the good state (with the probability p ) or failed state again (with the 
probability 1 - p ). When the system stays at a good state, it may fail with a 
failure rate A . Then the process is a Markov regenerative process, where the 
restarting points are regeneration epochs. 

Given the initial state is the first restart state, the Markov regenerative 
process is depicted by Fig. 2.9 in which state S, is the i:th restart point and G, 
is the good state between S i and S i+I . 




Fig. 2.9. Markov regenerative processes of Example 2.9. 
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2.3.4. General procedure of Markov modeling 

A Markov process is characterized by its state space together with the transition 
probabilities over time between these states. The basic steps in the modeling and 
analysis are described in the following. 



Setting up the model 

In the first step, a Markov state diagram can be developed by determining the 
system states and the transitions between these states. It also includes labeling the 
states such as operational, degraded, or failed. There could be several states in the 
degraded category. 

The state diagrams depict all possible internal relationships among states and 
define the allowable transitions from one state to another. In general, the state 
diagram is made up of nodes and links, where the nodes represent the different 
states and the links represent the transition between the connected two states. 

For DTMC, the time between the two states is discrete, which is usually set 
as 1 unit. On the other hand, the time between the two states is continuous for 
CTMC. The Markov chain can be constructed by drawing a state diagram that is 
made up of the units. 

For the semi-Markov process, the building of the model is more complex for 
it contains two steps. First, the state diagram is drawn as a DTMC with transition 
probability matrix P. Then, the process in continuous time is set up by making 
the time spent in a transition from state i to state j have Cdf F t J ( t ) . 



Chapman-Kolmogorov equations 

The second step converts the Markov state diagram developed in the preceding 
step into a set of equations. The well known equation for Markov models is the 
Chapman-Kolmogorov equations, see e.g. Trivedi (1982). 
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Solving the equations 

Solving the state equations is sometimes complicated. An analytical solution of 
the state equations is feasible only for simple problems. Fortunately, a number of 
solution techniques exist, such as analytical solution, Laplace-Stieltjes 
transforms, numerical integration and computer-assisted evaluation, which can 
simplify this task, see e.g. Pukite & Pukite (1998, pp. 119-136). 

The use of Laplace-Stieltjes transforms in engineering is well known, see 
Gnedenko & Ushakov (1995) for details. Important applications are in control 
system stability evaluation, circuit analysis, and so on. Laplace-Stieltjes 
transforms provide a convenient way of solving simpler models. Solution of the 
Markov state equations using this approach involves two steps: 

a) State equations are transformed to their Laplace counterparts. 

b) The resulting equations are inverted to obtain their time-domain solutions. 

If the mission times are short and if the transition rates are small, then 
approximations can be used that may meet the accuracy requirements. An 
example is as follows. 



Example. 2.10. Consider that a state diagram can be expressed as a sequence of 
transitions, as shown in Fig. 2.10. 




Fig. 2.10. N - state Markov diagram. 



The state probability for the last state can be given in Laplace-Stieltjes transform 
by 
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P N (s) = 



"' A N 

(5 + ^)(5 + yl 2 )-"(i + ^ w )5 



By expanding the denominator, substituting this expression in the equation for 
P N (s) and then performing the long division, we get 



p n ( j ) - /v+i n 4 n + ■ ■ ■ 



(2.37) 



=1 S i = 1 i=l 



This equation can be easily inverted using inverse Laplace-Stieltjes transform and 
we have 



,N N 



,N - 1 N N 



p N (o = — fit — - — n a , £ 4 + ■ 

N /V!if (7V-l)!%f 



2.4. Nonhomogeneous Poisson Process (NHPP) Models 

A counting process, N(t), is obtained by counting the number of certain events 
occurring in the time interval [0,0- The simplest model is the Poisson process 
model which assumes that time between failures are exponentially distributed and 
has independent increment, and it has a constant failure occurrence rate over 
time. Such a model is also a Markov model that has been discussed before. Here 
we will focus on the case of time-dependent failure occurrence rate, or general 
NHPP models. Such models are widely used to model the number of failures of a 
system over time, especially in software reliability analysis (Xie, 1991). 



2.4.1. General formulation 

Nonhomogeneous Poisson Process (NHPP) models are very useful in reliability 
analysis, especially for repairable systems. Since hardware systems are usually 
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repairable, and software debugging is a repair process, NHPP models can be used 
for both software and hardware, and for combined systems. 

For a counting process {N(/).f^0} modeled by NHPP, N(t) follows a 
Poisson distribution given the following underlying assumptions of the NHPP: 

1) N(0) = 0, 

2) { N (t), f>0} has independent increments, 

3) Pr{/V(t + h)~ N(t) = 1} = A(t) + o(h ) , 

4) ?r{N(t + h)~N(t)>2} = o(h). 

In the above o(h) denotes a quantity which tends to zero for small h. The 
intensity function A(t) is defined as 

A/-»0 A t 

If we let 

t 

m(t) = j A(t)dt 
o 

then it can be shown, see e.g. (Ross, 2000: pp. 284-286), that 

?r{N(t + t 0 )-N(t 0 ) = n} = cxp{m(t+t 0 )-m(t Q ))- [m{t+t * ) ~ m{t '> )T , n>0 

n\ 

That is, N(t+t 0 )~ N(t 0 ) is a Poisson random variable with mean 
m(t + t 0 )-m(t 0 ) . This implies that N(t) is Poisson given /V( 0 ) = 0 at the 
initial time t o =0, i.e., 

Pr{N(t) = n} =^^-exp{-m(t)} , n=0,l,2 (2.38) 

n! 



Here m(t) is called the mean value function of the NHPP. If N(t) represents 
the number of system failures, the function m(t ) describes the expected 
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cumulative number of failures in [0,?). Hence, m(t) is a very useful descriptive 
measure of the failure behavior. 



2.4.2. Reliability measures and properties 

Given the mean value function m(t), the failure intensity function A(t) can be 
computed by 

A(t) = (2.39) 

dt 

Moreover, the reliability function at time t {) is given by 

R(t 1 1 0 ) = exp{m(f 0 ) - m(t + 1 0 )} (2.40) 

Generally, by using different functions m(t), different NHPP models can be 
obtained. In the simplest case for which A(t) is constant, the NHPP becomes a 
homogeneous Poisson process which has a mean value function as t multiplied by 
a constant. 

Similar to the Poisson distribution to which the NHPP is related, it is 
characterized by several unique and desirable mathematical properties. For 
example, NHPPs are closed under superposition, that is, the sum of a number of 
NHPPs is also a NHPP. Generally, we may mix the failure time data from 
different failure processes assumed to be NHPP and obtain an overall NHPP with 
a mean value function which is the sum of the mean value functions of the 
underlying NHPP models. 

Any NHPP can be transformed to a homogeneous Poisson process through 
an appropriate time-transformation. From the general theory of NHPP, it is 
well-known that if {N{t),t^ 0) is a NHPP with mean value function mil), then 
the time-transformed process N*(t) defined as 

N'(t) = N(y(t)), t > 0 (2.41) 

is also NHPP. The mean value function of the NHPP ( > 0) is 
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m\t) = m(y(t)), t>0 (2.42) 

Especially, if y(t) = m~\t), we have that the time-transformed process becomes 
a homogeneous Poisson process with rate one, i.e., the mean value function is 
equal to t. 



Example 2.11. Suppose the mean value function of an NHPP model is 
m{t) = a-[l - exp(-fc-0] , a>0,b>0 



Let 



y(t) = m ‘(t) = —-lnfl - ~ 

-b \ a) 

Then the time-transformed process N ( t) = N(y(t )) is also an NHPP with the 
mean 



m\t) = m(y(t)) = a ■ [l - exp{-£» • >>(/)}] = t 

Therefore, the failure intensity function is derived by 

dt 

where X ( t ) is a constant which indicates that this time-transformed process 
becomes a homogeneous Poisson process with constant rate 1 . 



2.4.3. Parameter estimation 

Usually, the mean value function m(t) contains some unknown parameters. The 
estimation of them is generally carried out by using the method of maximum 
likelihood or the method of least squares. 
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Denote by n ) the number of faults detected in the time interval 
where 0 = / o <t t <"'<ti c and t t is the running time since the beginning. The 
likelihood function for the NHPP model with mean value function m(t) is 

Un„n, fl ^ > ~ m(> ' )l ' exp|mft -' > " m(l ' > 1 . 

“ n! 

The parameters in m(t ) can then be estimated by maximizing this likelihood 
function. Usually, numerical procedures have to be used in solving the likelihood 
equations. 
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MODELS FOR HARDWARE 
SYSTEM RELIABILITY 



In the computing systems, hardware (such as hard disk, router, processor, CPU, 
memory, etc.) provides the fundamental configurations to support computing 
tasks. This chapter focuses on the methods and models that are commonly used in 
analyzing the hardware reliability. They are also useful for integrated systems 
which will be discussed in later chapters. 

Reliability models for single component system are first presented. Then, 
some models of parallel configurations are studied. Following that, some other 
techniques in fault tolerance system including load-sharing and standby 
configurations are also shown. 



3. 1. Single Component System 

We first consider a system with one component or when the system is considered 
as a black-box. A single hardware component may have a normal functioning 
state, a few degraded states and a failed state. This section analyzes the reliability 
performance of the single component, considering a single failure mode, double 
failure modes and multiple failure modes. 
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3.1.1. Case of a single failure mode 

Suppose that there are two states, and a single, irreversible transition between the 
two states as shown in Fig. 3.1. The two states are operational state and failed 
state denoted by state 1 and 2 respectively. Such a case is called single failure 
mode case here. 




Fig. 3.1. State transition diagram for a single component with 
a single failure mode. 



In Fig. 3.1, A is the transition rate from state 1 to state 2, and it corresponds to 
the failure rate of the hardware component whose lifetime is assumed to follow 
exponential distribution. The component reliability (the probability of being in 
state 1 ) is given by 

/?(!) = />,(/) = exp(-/fc) (3.1) 

If the component is repairable with the repair rate fU , the Markov model is 
shown by Fig. 3.2. 




Fig. 3.2. State transition diagram for a repairable component. 



The reliability function that the component first reaches the failed state is also 
exp(-/i/) . However, the system availability function is the probability for the 
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component to stay at operational state (state 1) at the time instant t and it is given 
by 

A(t) = P x (t) (3.2) 

To obtain P ] (t) , the Chapman- Kolmogorov equations can be written as 

P{ (t)=MP 2 (O-APtit) (3.3) 

P 2 '(t) = AP l (t)-{iP 2 (t ) (3.4) 

Since the system has to be at state 1 or state 2, P 2 (t) = 1- P t (t). Substituting this 
into the above equations, we get 

P 1 ‘(t) = -(p + A)P l (t) + p (3.5) 

With the initial conditions 

P,(0) = l,P a (0) = 0 (3.6) 

we can obtain the availability function as 

A(t) = P l (t)=-^-expHM + 'Ot} + - Ji r (3.7) 

yW + A ju + A 

Example 3.1. Suppose that a hardware system has been working for 1000 hours 
during which the system failed 30 times and the total repair time for all the 
failures is 150 hours. If the hardware failure time and repair time follow the 
exponential distributions, then the expected failure rate and repair rate can be 
estimated by 

30 30 

A = — — = 0.03 and,M = — = 0.2 (3.8) 

1000 150 

The reliability function is 

R(t) = exp(-0.03f) 

and the availability function is 

A(t) = P,(0 = 0.1304exp(-0.23t) + 0.8696 
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The curve for availability function A(t) is depicted by Fig. 3.3. 




Fig. 3.3. The availability function for Example 3.1. 



3.1.2. Case of double failure modes 

The reliability evaluation above is based on faults which are permanent in nature. 
By considering the double failure modes including both intermittent failures and 
permanent failures (Prasad, 1991), a reliability model is presented here. The 
hardware component is assumed to start from an operational state and can go to 
either an intermittent failure state or a permanent failure state. The intermittent 
failure can also lead the component into the permanent failure state. This 
scenario is presented by a Markov model as shown in Fig. 3.4. 

In Fig. 3.4, the states 0, 1 and 2 are operational, intermittent failure and 
permanent failure states, respectively. According to the model, the state 0 can 
make a transition to state 1 with a rate v and to state 2 with a rate A . From the 





Computing System Reliability 



45 



intermittent failure state 1, it can either go to operational state 0 with a rate jj. or 
to permanent failure state 2 with a rate A . 




Fig. 3.4. Markov model considering intermittent faults. 



Let P 0 (f) t Pj(0 and P 2 (0 be the probabilities of being in the states 0, 1 and 
2, respectively. From Fig. 3.4, a set of Chapman-Kolmogorov equations can be 
written in the matrix form as: 

[P 0 ’(0. W P 2 \t)\ = [/>„('), P, (0. Pi(t)]T (3.9) 



where the transition matrix T is given by 



-v — A 



T = 



M 

0 



v A 
- H~ A A 
0 0 



(3.10) 



Taking the Laplace-Stieltjes transform, we get 

[p 0 (0), P, (0), P 2 (0)] = [P 0 (si P x (5), P 2 (s)] • [51 - T } 



where 



[Sl-T } = 



S + v + A 
0 



- V 

S + ju + A 
0 



-A 

-A 

S 



(3.11) 



(3.12) 



and Pj( 0) is the initial value of Pj(f) at r=0, 1=0, 1,2. Flence, we have 

[p 0 (s), p, (5), p 2 (*)1 - (p„ (0), P, (0), P 2 (0)] • (si - T p 



(3.13) 
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Assuming that the system starts from the operational state, then the boundary 
condition is 

) (0), P l (0),P 2 (0)] = [i,o,o] . 



Hence, 

[P 0 (j)./ > 1 (f),/ , 2 W] = [l.0,0]-[si-r ]= [i,o,oj- A _1 (3.14) 

and the inverse of the matrix is given by: 



4-' 



1 

detA 



+ s(v + //) 

Sfi 

0 



sv 

s 2 + s(v + A) 

0 



sA + A(v + p + A) 
sA + A(v + p+A) 
s 2 ■+■ j(v + n + 2 A) f A(v + jU + A) 



(3.15) 



where 

det A = s 3 + s 2 (v + ju + 2A) + s(Av + A<u + A 2 ) . 
Solving for /^(j), we obtain (Prasad, 1991), 

P Q ( S ) = HAUL 

s + s{2A + v + p) + Av + Aju + A 



(3.16) 



(3.17) 



Taking the inverse Laplace-Stieltjes transform, the system availability function 
can be obtained as 



A(t) = P 0 (r) = — exp(-if) + — — exp{-(/l + v + p)t } (3.18) 

v + fj, v + jU 



Example 3.2. Suppose that for a computing system, the rate for intermittent 
failures to occur is v = 0.02 and for permanent failures A - 0.01 . The repair rate 
from the intermittent failure state to operational state ju = 0.08 . Substitute them 
into the above availability function, we get 

A(t) = P 0 (t) = — ' — -exp(-/i/) + — - — exp{-(/i + v + p)t} 
v + /u v + ju 

= 0.8exp(-0.01r) + 0.2exp(-0.1 lr) 
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3.1.3. Case of multiple failure modes 

This model with multiple failure modes applies if the given component can fail in 
several modes. These modes have different effects on the system operations, e.g. 
Levitin el al. (1998). The Markov diagram for a component with three failure 
modes, such as a component that can fail in either open or shorted mode or may 
experience drift outside the specified range, has the following states: 

State 1 : Component is fully operational. 

State 2: Component has failed in open mode. 

State 3: Component has failed in shorted mode. 

State 4: Component has drifted outside specification values. 

Note that in this case states 2 and 3 will be failed states and state 4 a degraded 
state. The Markov transition diagram for this case is shown in Fig. 3.5. 




Open failure 



Short failure 



Drift failure 



Fig. 3.5. Single component with three failure modes. 



In effect, the total failure rate for the component is given by 

Aj = A 0 + A 0 + A s 
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Generally, if the hardware component has 77 -type of failure modes and the 
transition among different failure modes is allowed, the Markov transition 
diagram is depicted by Fig. 3.6. 




Fig. 3.6. Markov diagram for n-type failure modes. 



In Fig. 3.6, the state 0 is the fully operational state, and states 1 to n represent the 
n different failure modes. Denote the transition probability from state i to state j 
by 

P (j =P{Z k+ 1 =j\Z k i, j = 0,1,2, ■ • • , « 

In fact, the models of single failure mode and double failure modes are two 
special cases of the 77 -type failure modes with 77=1 and 77=2, respectively. 



3.2. Parallel Configurations 

Parallel system is one of the most frequently used redundancy configurations in 
order to achieve fault-tolerance which is important in computing systems. A 
parallel configuration assumes that the failure of a component will not affect the 
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operation of the remaining components and all the components can support the 
functions of one another. 



3.2.1. Two-component parallel configuration 

As the simplest parallel system, two-component configuration is studied here. Its 
reliability block diagram is shown in Fig. 3.7. 




Fig. 3.7. Two-component parallel configuration reliabiMty block diagram. 



For this two-component parallel configuration, if both components are identical, 
there are three states: 

State 1: Two components are operational. 

State 2: Only one component is operational. 

State 3: System has failed (all components have failed). 

The applicable Markov transition diagram for the parallel two-component 
redundant system is depicted by Fig. 3.8. 

Fig. 3.8. Markov model for two parallel components. 



The solution for the system reliability can be shown to be 

R(t) = \-P,(t) = l-[l-exp(-At )] 2 
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3.2.2. Majority voting configuration 

Majority voting systems form an important class of redundant systems. In a 
majority voting system, all of the components are assumed to be in operation. 
Many voting systems for the /V-component hardware are based on the majority 
rule, see e.g., Ashrafi etal. (1994). 

The simplest majority voting system consists of three components and a voter. 
The reliability block diagram for a majority voter configuration is shown in Fig. 
3.9. 




Fig. 3.9. Triple majority voter reliability block diagram. 

This configuration is also known as a triple modular redundancy configuration, 
and it requires at least two good components for operation. Assuming that the 
voter is perfect, the system states are: 

State 1: Three components are operational. 

State 2: Two components are operational. 

State 3: System has failed. 

The Markov transition diagram is shown in Fig. 3.10. The solution for evaluating 
the reliability of the triple modular redundancy configuration is 

R(t) = exp(-3/l/) + 3exp(-2/if)[l - exp(-/l/)] 
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© — (D — (D 

Fig. 3.10. State transition diagram for three-component 
majority voting (Triplex-duplex). 



A modified scheme 

It is also possible to increase the reliability of the triple modular redundancy 
system by making a simple modification in the operating sequence. After the first 
failure has been detected, there are two remaining modules or components. There 
is usually no need to keep both of the remaining components, since it will not be 
possible to identify the failed component after the second failure. The resulting 
state transition diagram will become as shown in Fig. 3.11. 



Fig. 3.11. State transition diagram for three-component 
majority voting (Triplex-simplex). 



The reliability function can be derived as 

R(t ) = exp(-3 At) + 3exp(-At)[l - exp(-At)] 



Example 3.3. A three-component majority voting system has the failure rate 
A = 0.01 for each parallel component. 

Without removing any component when the first component fails, the 
reliability function is 

/?, (f) = exp(-0.03/) + 3exp(-0.02/)[l - exp(-O.Olf)] 
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The probability for the system to successfully complete a 10-hour mission can be 
computed as (10) = 0.9746. 

If we remove either one component if the first component fails, the reliability 
function is 

R 2 (t) = exp(-0.03/) + 3exp(-0.01r)[l -exp(-0.01/)] 

The probability for the system to work well in 10 hours is /? 2 00) = 0.9991 . Both 
curves for the two reliability functions R i ( t ) and R 2 (f) are depicted by Fig. 3.12. 




Fig. 3.12. Two reliability functions curves for the Example 3.3. 



3.2.3. A>out-of-/V voting configuration 

This redundancy configuration is known as the /V-Modular Redundancy. The 
configuration requires that k functional components out of a total of N are needed 
for the system to remain operational. Akhtar (1994) presented a Markov model to 
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analyze the &-out-of-TV voting system for both perfect and imperfect fault- 
coverage problems. The failures in the system may be covered or uncovered. 
Fault coverage is a measure of the ability to perform fault detection, fault 
location, fault containment, or fault recovery. 



Perfect fault-coverage modeling 

Fig. 3.13 shows the Maikov chain for the TV-component system with perfect fault- 
coverage. The process is birth-death process with a constant failure rate, denoted 
by A for each component. Flere Mi is repair rate for state i, 0 < i < N . 




Fig. 3.13. State transition diagram for the perfect-fault 
coverage Markov model. 



State i means i components have failed and the rest are operational. The 
probability for staying at the ;:th state is denoted by P i (t) , which can be easily 
obtained by solving the following Chapman-Kolmogorov equation: 

P 0 '(t) = M i P i (t)~NAP 0 (t) 

Pj'iO = (AT - i + 1 )AP i . i (0 + Mm i)A + Mi ]/? (0 , i = 1,2,3 .... - 1 
P/t (t) = AP Mi Pn (0 

with the initial conditions 



P 0 (0) = l,/>(0) = 0, 1=1.2 Af 

The system availability can be computed through 

m= 

r=0 



(3.19) 
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and the system reliability can be obtained by considering a pure birth process 
with JU = 0 through the equation: 

R(t)=Y,P,(f) (3.20) 

;=o 



Since there is no absorbing state in Fig. 3.13, the steady-state availability can be 
calculated by 



N-k 

i=0 



(N-i)A it! 
Mm (*-»')! 



(3.21) 



Imperfect fault-coverage modeling 

Under the assumption that each fault is recoverable with probability c. Fig. 3.14 
shows the Markov chain for the imperfect fault-coverage model, see, e.g., Akhtar 
(1994). There is a transition to an absorbing state (where no repair is possible) 
with probability (1-c). The absorbing state is represented by state ‘W+l”. Thus, 
there areA+2 states, denoted by £2 C - {0,1,..., N, N +1}. 




Fig. 3.14. State transition diagram for the imperfect fault 
coverage Markov model. 



As Fig. 3.14, the model is obtained by considering 3 classes of states: 



State 0: all units are operational. 
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State i ( 1 < i <* N ): i of the components have failed with repair possible at all i 
states, and the system transits from state i to i- 1 with rate fl i (//,■ = fi for 
a single-repair facility; = i /4 for a multiple-repair facility). 

State /V+ 1 : a system failure state, where repair is not possible. 

The Chapman-Kolmogorov equations can be given by 

P i '(t) = (N-i + DAcP^it) 

l< i< N - l (3.22) 

-[(N - i)A + jii ]P t ( t ) + MmP m (0 
P 0 '(t) = -NA-P 0 (t) + U l P i (t) (3.23) 

P N '{t)~Ac-P NA {t)-v N P N (t) (3.24) 

and 

P N+ , '(0 = i;V- 0/1(1 - c) ■ P t (0 (3.25) 

1=0 

Denote by X — SI - Q where Q is the transition-rate matrix by excluding the last 
row and last column of the whole transition-rate matrix. With the initial condition 
[/o(0). Pi (O)v, P N (0)] = [i,o,o,...,o] , Akhtar (1994) showed that 

D,(s) | J= _ rj exp(-r ; • t ) 

Jv+i 

m-rj) 

z=l,z*j 

and 

^>(0 = 1-1^) (3.27) 

i=0 

where r t (0<i < N + 1) are the roots of M-o , and D t (s) is the determination 
of the matrix that replaces the i:th column of X by the initial vector 

[/> 0 (0),P t (0) P N (0)f. 

The system availability can be computed with 
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A(0= 1^(0 (3.28) 

i=0 

The system reliability R(t) can be derived by considering a pure birth process as 

*(0= 2>,(/), jUi=0 (3.29) 

i=0 

The perfect fault-coverage model is a special case of the imperfect model by 
fixing c=l. The above results of availability and reliability functions can be 
similarly implemented in the perfect case by substituting c= 1 in those equations. 

Example 3.4. Consider a l-out-of-3 system with imperfect fault-coverage. 
Suppose /U i = i • fi for a multiple-repair facility and the numerical values for 
/l = 10' 6 , /^lO -5 and c = 0.95. The Markov model will contain five states 
{0,1,2,3,4} as the Fig. 3.14, where state 4 is an absorbing failure state, and state 3 
is a non-absorbing failure state. 

For availability function A(t), from the state transition rate, we have 

j + 3 A -/u 0 0 

- 3 Ac s + 2A + n -Ifi 0 

0 -2 Ac s + A + 2fx -3// 

0 0 -Ac s + 3/U 

The four real roots are obtained by solving M-o using numerical method: 

r, =1.36932, r 2 =l 10.456, r 3 =219.544, r 4 =328.631, 

Finally, the state probabilities are obtained as 

£>,(•*) L=-r, exp(-r j •/) 

m-v 

l-\,Z*j 





(3.30) 



where 
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D q ( s) = (s + 2 A + ju) • [(i + A + 2 jU) • (5 + 3 ju) -3 A-c-/j]-4A-c-ju-(s + 3//) 
D,(s) = 3A-c-[(s + A + 2fl)'(s + 3//) -3 A-c- 
D 2 (s) = 6(A-c) 2 -(s + 3m) 

D 3 (s) = -6(A'c) 3 

and 

P 4 (0 = 1-P 0 (0- p, (0 - P 2 (0 - p 3 (f) (3.32) 

The numerical results of the five state probabilities are plotted in Fig. 3.15, where 
P 2 (t) and P 3 (t) are almost 0. 




Fig. 3.15. State probabilities for Example 3.4. 



The availability and the reliability functions can be obtained accordingly using 
Eqs. (3.28) and (3.29). 
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3.3. Load-Sharing Configurations 

While components in parallel systems are designed to carry full load, in load- 
sharing systems, each component is designed to carry only part of the load. If one 
component fails in the load-sharing system, then the remaining components share 
its load. Furthermore, since the components now carry heavier load, their failure 
rates will increase due to the additional stress. 



3.3.1. Two-component load-sharing system 

Consider a parallel load-sharing system consisting of two components. Under the 
load-sharing conditions, assume each component carries only one-half of the 
load. The following states can be identified: 

State 1: Two components are operational on a load-sharing basis. 

State 2: One component has failed, the other carries full load. 

State 3: Both components have failed, i.e., system failure. 

The state transition diagram is shown in Fig. 3.16. 



Fig. 3.16. State transition diagram for two-component load 
sharing system. 



Here, the transition rate for the first transition is only one-half that for the full- 
load parallel system. The system reliability function is then given by 

R(t) = exp(-0.5i/) + 2(1 -exp(-0.5i0]exp(-/l/) 
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3.3.2. A-out-of-iV load-sharing system 

A load-sharing A-out-of-A system is a configuration which works if at least A out 
of A components are functioning and the surviving subsystems share the total 
load. Shao & Lamberson (1991) studied such a system. The assumptions of the 
model are given as follows: 

1) The failure rate of all functioning components is the same and a 
functioning unit of i components has the constant failure rate, \ , 
i = k,..., A. 

2) A failed component must be detected and disconnected by a controller 
and the probability of success is a . If the controller cannot detect and 
disconnect a failed unit or the controller itself has failed, the system fails. 
The controller failure rate is a constant, denoted by A c . 

3) At most r components can be in repair at one time each with a repair rate 
JU , so the repair rate for j components failed is: jUj = fi min(j, r) . 

4) A repaired component is as good as new and is immediately reconnected 
to the system with negligible switch-over time. 

5) The controller is never repaired or replaced during a mission. 

Based on the above assumptions, the Markov model can be constructed as 
depicted by Fig. 3.17. 




Fig. 3.17. Markov transition diagram for ioad-sharing k-out-of-N system. 
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As shown in Fig. 3.17, the state space for the system is defined below: 

State j (j=0,l,...,N-k): j components have failed and have been disconnected 
from the network, the remaining (N-j) components and the controller are 
functioning. 

State N-k+l\ the system fails because only (k-l) components are functioning, 
but the system can return to working state (N-k) at a repair rate JUn-m ■ 

State F: the system fails because the controller cannot detect and disconnect a 
failed unit. 

The Chapman-Kolmogorov equations can be given by 

P 0 '(t) = -(NA N + A c )P 0 (t) + M l P l (t) 



Pj = j)A N _j +A c + jUj]Pj(t) 

+ ot(N -j + l)A N , j+l P H ( t ) + Mj* 0), 



y = l,2 N-k 



P N-k+l - My-k+l ^N-k + 1 + Qk^k ? N-k 

P/(0 = (1 - a) + *c ] 

1=0 



The initial conditions are: 



P,(0) = 




0 = 0 ) 

a* o) 



(3.33) 



(3.34) 



These equations can be numerically solved. The system avhailability function and 
reliability can be obtained accordingly. 



Example 3.5. Consider a jet engine functioning under full load on a commercial 
airplane. Two functioning jet engines are required for flying, but 4 engines are 
functioning for full power. An engine controller manages the load-sharing. When 
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4 engines function in the airplane, the load on each is much less than when they 
function alone. From the test data, if 4 engines are functioning for an airplane, the 
failure rate for each engine is reduced to 50%, while if three engines are 
functioning, the failure rate is reduced to 60% and two engines to 70%. The 
switching probability, Ot =0.99, the jet engine failure rate under the full load 
A =0.1, A c =0.01 and repair rate fi t =0.2 (7=1,2, 3). 

The above jet engine system is a 2-out-of-4 load-sharing system. Its CTMC 
can be modeled as Fig. 3.18. 



0.198 0.1782 0.1386 




Fig. 3.18. Markov model for the 2-out-of-4 load-sharing system. 



The Chapman-Kolmogorov equation can be constructed. It is then possible to 
solve for all the state probability functions P t (t) and then obtain the system 
availability function A(t). 



3.4. Standby Configurations 

Standby redundancy is particularly important in those applications where low 
power consumption is mandatory, such as in spacecraft systems. Standby systems 
also yield better reliability than can be achieved using the same quantity of 
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equipment in parallel mode. This happens when the standby condition failure rate 
is assumed to be zero. If this assumption does not apply, the model needs to be 
modified to account for the storage failure rate. Moreover, the switch and monitor 
to control the system may fail caused by their own faults, which will also be 
considered in this section. Finally, the multi-mode operations for the standby 
redundancies will be discussed as well. 



3.4.1. Standby with zero storage failure rate 

Usually standby components can be assumed to have zero or very low failure rate 
in storage. If this is the case, then we have a simple system consisting of only two 
components, a primary and a standby spare, as shown in Fig. 3.19. The spare is 
passive until switched in. 




Fig. 3.19. Standby configuration. 



Both components are assumed to have the same failure rate, A , when operating. 
In the standby mode, the failure rate is zero (i.e. cold standby). Since only one of 
these components is used at a given time, we identify the following states: 

State 1: Primary component is operational. 
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State 2: Standby component has been switched in and is operational. 

State 3: Both components have failed, i.e., system failure. 

The transition rate from state 1 to state 2 and that from state 2 to state 3 are equal 
to A. The reliability function can be obtained as 

R(t) = exp (-At) + [1 - exp(-Ai)]exp(-At) 

The same approach can be extended to standby systems where there are (TV- 1) 
cold standby components together with one primary component. The state 
transition diagram is depicted by Fig. 3.20, where state /V+l is the system failure 
state. 




Fig. 3.20. CTMC for iV-1 cold standby components together with one 

primary. 



The reliability function can be obtained as 

R(t) = £[1 - exp(-/h)]' -1 exp(-A/) 
/=! 



(3.35) 



Example 3.6. A system contains two cold standby components and one primary 
component each of which has the failure rate A = 0.02 . Then, its reliability 
function is computed as 

R(t) = e^ 02 ' + [1 - e~° 02 ‘ ]e“° 02 ' + [1 - e~° m ' fe^ 02 ' 



and the curve is shown in Fig. 3.21. 
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Fig. 3.21. Reliability function for 2-coid standby and one primary 

component 



3.4.2. Standby with nonzero storage failure rate 

If the standby component has a nonzero failure rate of A s (such as it is energized 
as a warm or hot component), see e.g. Pukite & Pukite (1998, pp. 73-80), then we 
can identify these states: 

State 1 : Both components are good, primary component is operating. 

State 2: Primary component has failed; secondary has been switched on and is 
operational. 

State 3: Standby component has failed; system is still operating with primary 
component. 

State 4: Both components have failed; system failure. 

The state transition diagram is depicted by Fig. 3.22. 
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Fig. 3.22. Standby with nonzero storage failure rate. 



After merging states 3 and 2 P 3 into P 2 , we can easily obtain the reliability 
function as 

R(t ) = exp(-U + A,)/} + [1 - exp{— (>l + A s )t }]exp(-i/) 

3.4.3. Imperfect monitor and switch 

In the models described so far, the monitors and switches are assumed to be 
perfectly operational. In this section, we include the effects of imperfect monitor 
and switch, i.e. we consider the failure of the fault monitor and switch, 
respectively. 

The conventional failure monitor and switch can fail in one of two modes: 

1 . In a state where the failure monitoring ability is disabled. 

2. In a state where a false switching to the next standby component has 
occurred. 

If we assume equal failure rates to the primary and secondary components 
and initially ignore the component storage failure rates, then by assigning A sl and 
t,2 to the monitor and switch failure rates for the two modes described above, 
the system states will be: 

State 1: Primary component, fault monitor, and switch are in operational 
condition. 

State 2: Primary component is operating, but the switch has failed. 
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State 3: Secondary component is operating. 
State 4: System failure. 

The state transition diagram is shown in Fig. 3.23. 




Fig. 3.23. Standby configuration with imperfect monitor and switch. 



Note that the states 2 and 3 are identical, in which there is only one operational 
component left. We can reduce the number of system states and simplify the state 
diagram. The final reliability function is given by 

R(t) = exp{-(/l + /i Jl + ^ j2 )/} + [1 -exp{-(/4 T/i,, + A s2 )t}]exp(-At) 

It is possible to extend the same concept to system configurations with more 
than one standby component by viewing the imperfect monitor/switch as another 
parallel component. 



3.4.4. Multiple mode operation 

Many standby systems are designed for multiple mode operation. Chen & Bastani 
(1992) constructed a CTMC to evaluate the reliability of multiple mode operation 
system with both full and partial redundancies. The assumptions in their model 
are given below: 

1) Failure times of components are exponentially distributed with a constant 
failure rate. 
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2) A full redundancy requires the full power of a component and serves as 
either a primary component or a hot standby of that component. 

3) A partial redundancy requires part of the processing power of the full 
primary component and serves as a warm standby of that component. 

Suppose a system has a full redundancy or primary with failure rate A l , and a 
partial redundancy with failure rate Aj . Using the Markov model for this system, 
the reliability function can be obtained as 

R{t) = exp(-i,f) + [1 - exp(-/l,f)]exp(-^ 2 f) (3.36) 



Also, this Markov model can be extended to an N modes operation system. 
Suppose a system has one primary component with failure rate Aj and AM partial 
redundancies with failure rates (Aj,..., A N ). The reliability function can be 
obtained as 



i=2 



R(t) = exp (-A,t) + j exp(-A,t) j"J{l - exp(-Ajt)] 

7=1 



(3.37) 



Example 3.7. Suppose a system contains one primary with failure rate 0.03 and 
two partial redundancies with same failure rate of 0.05. Substitute the A l = 0.03 
and A 2 = Aj = 0.05 into the Eq. (3.37), we have 

R(t) = exp(-0.03f) + [1 -exp(-0.03t)]exp(-0.05f) 

+ [1 - exp(-0.03f)][l - exp(-0.05f)]exp(-0.05f) 

The curve of the reliability function is shown in Fig. 3.24. 

If we further consider a system with two partial redundancies having nonzero 
storage failure rate, say 0.01, the Markov model is constructed as the CTMC in 
Fig. 3.25. 
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Fig. 3.24. The reliability function in Example 3.7. 




Fig. 3.25. Markov model for multi-mode operation with 
nonzero storage failures. 



At state 1, all the three components are functioning. At state 2, the primary fails 
and the other two are functioning. At state 3, one primary and one redundancy 
fail while the other is functioning. At state 4, the primary and one redundancy are 
functioning but the other redundancy fails due to nonzero storage failure. At state 
5, the primary is functioning while the two redundancies fail due to nonzero 





Computing System Reliability 



69 



storage failure. Finally, at state F, all the three components have failed and the 
system fails. 

The Chapman-Kolmogorov equations can be written as 

/>,'(/) = -0.05P,(O 

P 2 '(t) = 0.03 ^ (r) - 0.06 P 2 (t) 

P 2 '(/) = 0.06 P 2 (/) + 0.03/> (/) - 0.05/>3 (/) 

P/(t) - 0.02P,(/)-0.04P 4 (0 
P 5 1 (t) = 0.0 1P 4 (f ) - 0.03P S (/) 

/y (0 = 0.05P 3 (t) + 0.03P S (0 

with the initial conditions are P,(0) = 1 and others are 0, the system reliability 
function can be obtained as: 

R(t) = l-P F (t) 



3.5. Notes and References 

For reliability of hardware systems, Pukite & Pukite (1998) summarized some 
common configurations and implemented simple Markov models. Other than the 
Markov models, Elsayed (1996) described many other models that are commonly 
used in reliability engineering. There are also many general texts on reliability 
engineering and most of them deal with models for hardware systems. 

Bobbio et cil. (1980) first used Markov models in the study of a single 
hardware component that may contain multiple failure modes. Recently, Levitin 
et al. (1998) introduced a method called UGF (Universal Generating Function) in 
dealing with multiple failure modes. Alexopoulos & Shultes (2001) presented a 
method using an importance-sampling plan that dynamically adjusts the transition 
probabilities of the embedded Markov chain by attempting to cancel terms of the 
likelihood ratio within each cycle. 
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Kuo & Zuo (2003) summarized the reliability modeling for A-out-of-)V 
configurations and presented the optimization schedules in improving the system 
reliability. Arulmozhi (2003) further presented a simple and efficient 
computational method for determining the reliability of A-out-of-A system whose 
components are of heterogeneous property. For parallel configurations, besides 
the majority voting and A-out-of-A voting introduced in this chapter, there are 
many other voting schemes, such as the enhanced voting scheme (Ammann & 
Knight, 1988). the weighted voting scheme (Levitin, 2001) and so on. Latif- 
Shabgahi et cil. (2000) summarized various voting schemes for different fault 
tolerant systems. Chang et al. (2000) provided an extensive coverage on 
consecutive-k-out-of-n systems. 

For the standby configurations, Sherwin & Bossche (1993) summarized the 
reliability analysis for both hot (active) standby and cold standby systems. Later, 
Chen et al. (1994) studied the reliability of a warm standby system which is an 
intermediate case between the hot and cold standby. Recently, Zhao & Liu (2003) 
provided a unified modeling idea for both parallel and standby redundancy 
optimization problems based on the system reliability analysis. 





CHAPTER 




MODELS FOR 

SOFTWARE RELIABILITY 



Software is an important element in computing systems. Different from 
hardware, the software does not wear-out and it can be easily reproduced. 
Furthermore, software systems are usually debugged during the testing phase so 
that their reliability is improving over time as a result of detecting and removing 
software faults. Many software reliability growth models have been proposed for 
the study of software reliability, e.g. Xie (1991), Lyu (19%) and Pham (2000). 

Markov models are one of the first types of models proposed in software 
reliability analysis. This chapter mainly summarizes models of this type. In 
addition, Nonhomogeneous Poisson Process (NHPP) models, which are 
important in software reliability analysis, are also discussed in this chapter. 



4. 1. Basic Markov Model 

The basic Markov model in software reliability is the model originally developed 
by Jelinski & Moranda (1972). It is one of the earliest models and many later 
Markov models which can be considered as modifications or extensions of this 
basic Markov model. 
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4.1.1. Model description 

The underlying assumptions of the Jelinski-Moranda (JM) model are: 

1) The number of initial software faults is an unknown but fixed constant. 

2) A detected fault is removed immediately and no new faults are 
introduced. 

3) Times between failures are independent, exponentially distributed 
random variable. 

4) All remaining faults in the software contribute the same amount to the 
software failure rate. 

The initial number of faults in the software before the testing starts is denoted by 
N 0 . From the assumptions (3) and (4), the initial failure rate is then equal to 
N 0 -<l > , where (ft is a constant of proportionality denoting the failure rate 
contributed by each fault. It follows from the assumption (2) that, after a new 
fault is detected and removed, the number of remaining faults is decreased by 
one. Hence after the z:th failure, there are N Q - i faults left, and the failure rate 
decreases to <P(N 0 -i) . This Markov process is depicted by Fig. 4.1 where state 
k means that there are k faults left in the software. 




Fig. 4.1. Markov process of Jelinski-Moranda model. 



The i:th failure-free period, i.e., the time between the (z'-l ):st and the / : th 
failure is denoted by 7), i = l,2,...,N 0 . By the assumptions, 7J’s are then 
exponentially distributed random variables with parameter 
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/l(O = $ty,-O-l)] = 0(No-* + l). i = l,2,...,N 0 (4.1) 

The distribution of 7} is given by 

P(T, <t,) = </>(N 0 -i + 1) exp{-0(A( o - i + l)r, } , i = 1,2, . . . , N 0 (4.2) 

The main property of the JM-model is that the failure rate is constant between 
the detection of two consecutive failures. It is reasonable if the software is 
unchanged and the testing is random and homogeneous. 



4.1.2. Parameter estimation 

The parameters of the JM-model may easily be estimated by using the method of 
maximum likelihood. Let /, denotes the observed / : t h failure-free time interval 
during the testing phase. The number of faults detected is denoted here by n 
which will be called the sample size. If a failure time data set 
/ ={f,,r 2 ,...,f„;n>0} is given, the parameters 0 and N 0 in the JM-model can 
be estimated by maximizing the likelihood function. 

The likelihood function of the parameters 0 and N 0 is given by 



1 * 1 2 *"■ > A/ 0 1 0) ~ Jl0(N o - i + l)exp{-0(yv o -/ + l)f ( ) 

i= i 

= f fl (Wo - 1 ■ + 1) exp{- 0 £ ( - i + l)f ( ] 



i* I 



i=l 



(4.3) 



The natural logarithm of the above likelihood function is 



lnZ.=ln 



m*o -• + Dexpj - 0 J(/V 0 - 1 + 1)/,. 



1=1 



1=1 



=«ln0+£ln(N o -i + l)-0^(A/ o -i + l)l,. 



(4.4) 
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By taking the partial derivatives of this log-likelihood function above with 
respect to N 0 and <j ) , respectively, and equating them to zero, the following 
likelihood equations can be obtained, 



3 In L y> 1_ 

Mo ~hu 0 -«' + l 



-2>',=o 



i=i 



(4.5) 



and 



9ln L 
dip 






(4.6) 



By solving <p from Eq. (4.6), we get 






£ ((V 0 -t + iyi, 



_/=l 



(4.7) 



and by inserting this into Eq. (4.5), we obtain an equation independent of 0 as 



No 




1 

+ 

N 0 -n + 1 









(4.8) 



An estimate of A^ 0 can then be obtained by solving this equation. Inserting the 
estimated value into Eq. (4.7), we obtain an maximum likelihood estimate (MLE) 
of <j> . 



Example 4.1. Suppose that a software product is being tested by a group. Each 
time a failure is observed, the fault causing the failure is removed. The 30 test 
data of time between failures are recorded in Table 4.1. 

Substituting the data of Table 4.1 into the likelihood equations, and solving 
them, we obtain N 0 = 54 and (p = 0.00077 . 
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After the 30 th failure, the estimated number of remaining faults is 
k = 54 - 30 = 21 and the failure rate at that time is 

A = k ■ <2> = 2 1 x 0.00077 = 0.0 1 62 



Table 4.1. A set of failure data. 



Failure 

number 


Time 

between 

failures 


Failure 

number 


Time 

between 

failures 


Failure 

number 


Time 

between 

failures 


1 


14.75 


11 


7.99 


21 


64.25 


2 


43.99 


12 


28.09 


22 


40.90 


3 


9.89 


13 


11.80 


23 


3.07 


4 


0.07 


14 


1.78 


24 


0.75 


5 


5.70 


15 


12.50 


25 


13.36 


6 


7.89 


16 


73.08 


26 


23.02 


7 


28.79 


17 


42.60 


27 


143.31 


8 


170.15 


18 


9.18 


28 


55.46 


9 


26.83 


19 


49.43 


29 


75.57 


10 


36.15 


20 


9.19 


30 


34.31 



The estimated reliability function after the 30 lh failures is 



R(t) = exp(-0.0162/) 
and the MTTF after the 30 th failures is estimated as 

MTTF = - = 61.84 
A 



Note that the estimation of the number of initial faults might be unreasonable. 
Usually more failure data should be accumulated for an estimate to be accurate. 
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4.2. Extended Markov Models 

In many cases, the basic Markov model (JM-model) is not accurate enough. 
Several of the assumptions may not be realistic. For example, software faults are 
not of the same size in a sense that some affect more input data than others do, 
and some faults are easier to be detected than others. Many extended models, 
which relax some assumptions of the JM-model, are proposed (Xie, 1991). Some 
of them are discussed in this section. 



4.2.1. Proportional models 

Moranda (1979) presented an extended Markov model whose basic assumptions 
are same as JM-model except assuming that the (i+l):st failure rate is 
proportional to the i:th failure rate, i.e. 

4 + i=C,4. i=0,l,2,... (4.9) 

This Markov process can be depicted by the Markov chain shown in Fig. 4.2, 
where state i represents that i failures have occurred. 




Fig. 4.2. The Markov chain for the proportional model. 



This kind of model is called proportional model in Gaudoin et al. (1994). The 
idea is to consider that the difference between two successive failure rates is due 
only to the debugging, and practical constraints lead us to believe that the effect 
of this debugging is multiplicative. A proportional model is completely defined, 
given the distributions of /t, and C={C„C 2 ,C 3 ,...}. 
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Deterministic proportional model 

In the simplest proportional model, all random variables are deterministic, i.e., 
Aj and C are constant. Hence it is called Deterministic Proportional Model as 
defined below. 



Definition 4.1. The Deterministic Proportional Model, with parameters A 
and 6 , is the software reliability model where the random variable 7} are 
independent and exponentially distributed with parameter 

/l exp {-(1-1)0} , t>l (4.10) 



This model was originally suggested by Moranda (1979) as geometric 
de-eutrophication model. Its detailed statistical property was studied by Gaudoin 
& Soler (1992) and we summarize some results here. 

For the sake of convenience, let C = exp(-0) , where Q is a real number. 
In fact, 6 represents the quality of the debugging. If no debugging is done at all 
(0 = 0), the failure rate remains constant; if the debugging is successful (6 > 0), 
the failure rate decreases, and then the reliability grows, etc. The parameter A is 
a scale parameter, and it is given by 

A = \/E{T x } (4.11) 

The likelihood for the observation of the first n times-between-failures f ( . 

(i=l,2,...n) is; 



\ (A,0)=U^PhM) 

i=i 




n(n- 1) 



0 + A ^ exp {-(i - l)0}t, 



i=i 



(4.12) 



Consequently, the maximum likelihood estimates of 6 and A , e{t x and 

Kh .--0 are the solution of the following equations 
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and 



| = £f,.exp{-0-l)<9} (4.13) 

A 1=1 



£ (n - 21 + l)t, exp ( -(i - 1)0} =0 (4.14) 

i=i 



This equation expresses that c = exp(-0) is a root of the polynomial of degree 
n-1, i.e. 



£(n-2i + l)f ( c ,_l =0 (4.15) 

i= i 



Example 4.2. Suppose a software system is tested by a group. The 30 test data of 
time between failures are recorded in Table 4.2. 



Table 4.2. 30 test data for time between failures. 



Failure 

number 


Time 

between 

failures 


Failure 

number 


Time 

between 

failures 


Failure 

number 


Time 

between 

failures 


1 


18.45 


11 


0.23 


21 


16.92 


2 


16.88 


12 


94.57 


22 


7.62 


3 


0.26 


13 


41.19 


23 


19.10 


4 


14.63 


14 


1.41 


24 


5.98 


5 


0.49 


15 


2.49 


25 


87.33 


6 


23.08 


16 


49.02 


26 


7.23 


7 


34.17 


17 


21.88 


27 


16.08 


8 


0.56 


18 


15.73 


28 


21.80 


9 


7.92 


19 


21.18 


29 


103.30 


10 


6.86 


20 


64.94 


30 


3.35 
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To estimate the parameters of A and 6 for the Deterministic Proportional 
Model, substitute the data of Table 4.2 into the above equations, and solve them 
to obtain Q- 0.038 and A — 0.0754. 

The mean time to failure (M I'll •) after the 30 th failures, E(T ^) , is 



MTTF = E(T 3l ) = 



1 

0.0754exp{-(31-l) -0.038) 



=41.47 (hours) 



If the customers require the MTTF of the software product should be no less than 
70 hours, i.e., E(Ti)> 70 , then 



EiT,) = 



1 

0.0754exp{-0.038(i - 1)} 



>70 



Solving this, we get i > 44.7 , so that the number of removed faults need to be at 
least 45. That is, at least 45-30 = 15 more faults need to be removed. 

The expected time for further detecting/removing the additional 1 1 faults is 



45 



1 



ytS 0.07 54 exp {-0.03 8(7 - 1)} 

This is an estimated additional testing time needed. 



=822.6 (hours) 



Lognormal proportional model 

In fact, the assumption of Deterministic Proportional Model that the C,- (mean 
quality) is constant, is not realistic. A more realistic assumption would be that the 
mean qualities of the successive debugging are independent random variables Q t 
with a homogeneous normal distribution. Then, 

C,. =exp(-C, ) 
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is a lognormal distribution. Gaudoin et al. (1994) presented a lognormal 
proportional model with 

A M = exp(-G,. )A t (4.16) 

in which Q. is normally distributed with mean 6 and standard deviation O . 
The mean and variance of 7] are derived by Gaudoin et al. (1994) as: 

£{7;} = iexpj(,'-l)(0 + -y) 

and 

Var(T i } = -^exp{2(i - 1)(0 + 0.5cr 2 ) } [2exp{(i - 1) ■ o 1 ) - 1] 

A 



4.2.2. DFI (Decreasing Failure Intensity) model 

A serious critique of the JM-model is that not all software faults contribute to the 
same amount of the failure rate. Some generalizations and modifications of the 
JM-model are presented in Xie (1987). We briefly describe this general 
formulation together with some special cases in this section. 



General DFI formulation 

The JM-model can be modified by using other function for A(i) . Note that A(i ) 
is defined as the rate of the occurrence of next failure after the removal of i - 1 
faults. The failure intensity is DFI (Decreasing Failure Intensity) if A(i) is a 
decreasing function of i. A DFI model is thus a Markov counting process model 
with decreasing failure intensity. 
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Under the general assumptions above, the cumulative number of faults 
detected and removed, {W(r),r>0} , is a Markov process with decreasing failure 
rate A(i) . The theory for CTMC can be applied. 

If Pj(t) = P{N(t) = i) , i = 0,1,..., N 0 , the Chapman-Kolmogorov equations 
are given as 

p Q '(t) = -MWt) 

P i '(t) = -Mi + l)P i (t) + Mi)P i - i (t), i = 2,3 N 0 - 1 (4.17) 

P„ a '(t) = -A(N 0 )P No _ l (t) 



with the initial conditions 

P 0 (0) = 1 and P t (0) = 0 for i>0 

The above equations can easily be solved and the solution is as follows (Xie, 
1991 ), 

P 0 (f) = exp{-/l(l)/} 



P.(t) = -— A([ \ — (e. -e 0 ) 
1 A(2)-A(l) 1 0 



N 0 - 1 

7J(r)= £ A / N °~%’ i = 2X-, N 0 -l 

7=0 



and for i = N 0 , we have 



r "o 



n o-i 

W-I 



7=0 



A (No -1) A(N q ) 
j A(j + D j 



where the quantities ej , 7=0, 1 , . . . , N 0 - 1 , are defined as 
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ej = exp{-/l (7 + !)•/}, 7=0,1,..., N 0 -l 



and A j i '' 1 can be calculated recursively through 



A, ll) = 



J A(i + 1) - A(j + 1) 



Mi) A u-^ J<{ 



A ,0 = -ZV ) 

7=0 



Some specific DFI models 

A direct generalization of the JM-model is to use a power-type function for the 
failure intensity function, A(i) . The power type DFI Markov model was studied 
by Xie & Bergman (1988) assuming the failure rate 

Mi) = <p[N 0 - (i - l)f,i = 1 , 2 ,..., N 0 (4.18) 

It is reasonable to assume that A(i) is a convex function of i and Ot is likely to 
be greater than one, since in this case, the decrease of the failure rate is larger at 
the beginning. 

Another special case of the DFI model is the exponential-type Markov model 
which assumes that the failure rate is an exponential function of the number of 
remaining faults. It is characterized by the failure rate function 

A(i) = 0[exp[-j3(N o -i + 1))-U, i = 1,2,..., A 0 (4.19) 

For the exponential-type DFI model, the decrease of the failure intensity at the 
beginning is much faster than that at a later phase. 

It is interesting to note that some of the proportional models can also be 
attributed to DFI model. If all the < 1 (i = 1,2,..., 7V 0 ) in a proportional model. 
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the failure rate MO is actually a decreasing function to the number of 
remaining faults, which follows the DFI definition. 



4.2.3. Time-dependent transition probability models 

Sometimes the failure rate function depends not only on the number of detected 
faults i but also on the time f, whose Markov process is shown as Fig. 4.3. 






Fig. 4.3. CTMC for time dependent transition probability models. 



There are several models which extend the JM-model by assuming that the 
probability of state change is also time-dependent. Schick- Wolverton model is 
one of the first models of this type (Schick & Wolverton, 1978). The general 
assumptions made by the Schick- Wolverton model are the same as those for the 
JM-model except that the times between failures are independent of the density 
function given by 

/(/, ) = <P(N 0 - i + IV, expj ~ W ~ I + 1)< ‘ 2 j , i = 1,2,..., N 0 (4.20) 

in which N 0 is the number of initial faults and <j> is another parameter. 

Hence, the main difference between the Schick- Wolverton model and the 
JM-Model is that the times between failures are not exponential. In the 
Schick-Wolverton model the failure rate function to the detection of the <:th fault 
is 



A(i,t i ) = <p(N 0 -i + l)t i 



(4.21) 
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Note that the failure rate function of the S chick- Wolverton model depends both 
on i, the number of removed faults and on t i , the time since the removal of last 
fault. 

The Schick- Wolverton model with time-dependent failure rate was further 
extended by Shanthikumar (1981). Shanthikumar (1981) model supposes that 
there are N 0 initial software faults and assumed that after i faults are removed, 
the failure rate of the software is given by 

= + (4.22) 

where m is a proportionality factor. The parameter estimation can also be 
carried out using the method of maximum likelihood. 

The Markov formulation and solution procedures of Shanthikumar (1981) 
model are briefly introduced here. Denote by P-(t) the probability distribution 
function of N(l) , the number of faults that are detected and removed during time 
[0,t). Under the Markovian assumption, we have that the forward Kolmogorov’s 
differential equations are given as follows, 

~~ = -N o 0(t)P o (t) 
at 

- = ( n 0 -i + mo Pi-iio - (n 0 - o mPiiD ; 1 * « < n 0 

at 

The initial conditions are 



3(0) = 0, J>0 and P 0 (0) = 1 



The equations can easily be solved and the solution is given by 



P t (t) = 



f Nt ^ 



V ‘ A 



expi - <p(x)dx 



N 0 -i 



1-exp ty{x)dx\ 



; 0 <i<N 0 (4.23) 
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4.2.4. Imperfect debugging models 

The imperfect removal of a detected fault is a common situation in practice and 
the JM-model does not take this into account. This section extends the JM-model 
by relaxing the assumption of perfect debugging process. During an imperfect 
debugging process, there are two kinds of imperfect removal: 

1) the fault is not removed successfully while no new fault is introduced. 

2) the fault is not removed successfully while new faults are generated due 
to the incorrect diagnoses. 

For the former type of imperfect removal, the process is still a monotonous death 
process in terms of the number of remaining faults; while the latter one is a 
birth-death process in terms of the number of remaining faults. Both types of 
imperfect debugging models will be discussed in the following. 



Monotonous death process 

Goel (1985) suggested a Markov model by assuming that each detected fault is 
removed with probability p. Hence, with probability q = 1 -p, a detected fault is not 
perfectly removed and the quantity q can be interpreted as the imperfect 
debugging probability. This process can be modeled by a DTMC as depicted by 
Fig. 4.4 where i is the number of detected failures. 




Fig. 4.4. DTMC for the monotonous death process of imperfect 
debugging model. 
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The counting process of the cumulative number of detected faults at time t is 
modeled as a Markov process with transition probability depending on the 
probability of imperfect debugging. Still it is assumed that times between the 
transitions are exponential with a parameter which depends only on the number 
of remaining faults. After the occurrence of j-1 failures, p-(i-l) faults are 
removed on the average. Hence, approximately, there are N 0 - p(i - 1) faults 
left, where N 0 denotes the number of initial faults as before. The failure rate 
between the (7-1): st and the /: th failures is then 

Mi) = 0[N o -p(i-l)\ (4.24) 



Using this transition function, other reliability measures can be calculated as 
for the JM-model. Note that the above rate function can be rewritten as 



MO = <t> ■ 




(4.25) 



and from this it can be seen that it is just the same as that for the JM-model with 
<t> replaced by <j> - p and A/ 0 replaced by N 0 / p . 

As a consequence, p, N 0 and 0 are indistinguishable. However, 0 ■ p 
and N 0 / p can still be estimated similar to that for the parameters in the 
JM-model and N Q / p can be interpreted as the expected number of failures that 
will eventually occur. Another advantage of using this model is when we know 
the probability of imperfect debugging, p. For example, from the previous 
experience or by checking after correction, the number of initial faults N 0 and 
the constant of proportionality 0 can be estimated. 



Example 4.3. Suppose that a software product is being tested by a group. The 30 
test data of time between failures are recorded in Table 4.3. 

If the software failures follow the above imperfect debugging model given 
p- 0.9, viewing it as the JM-model first, we get the following estimates 
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N 0 /p = 42.56 and <j>-p = 0.00104 

Substituting p- 0.9, we get 

N 0 = 38 and 0 = 0.00116 



Table 4.3. 30 test data for time between failures. 



Failure 

number 


Time 

between 

failures 


Failure 

number 


Time 

between 

failures 


Failure 

number 


Time 

between 

failures 


1 


8.12 


11 


0.01 


21 


85.56 


2 


9.76 


12 


20.48 


22 


17.95 


3 


10.52 


13 


5.28 


23 


57.01 


4 


59.98 


14 


65.28 


24 


80.08 


5 


8.67 


15 


11.83 


25 


48.40 


6 


29.96 


16 


10.60 


26 


25.01 


7 


24.26 


17 


62.42 


27 


97.98 


8 


30.74 


18 


18.97 


28 


58.61 


9 


51.00 


19 


162.48 


29 


55.75 


10 


18.23 


20 


4.88 


30 


4.58 



Birth-death process 

Furthermore, if we allow the imperfect debugging process to introduce new faults 
into the software due to the wrong diagnoses or incorrect modifications, the 
debugging process becomes a birth-death Markov process. Kremer (1983) 
assumes that when a failure occurs, the fault content is assumed to be reduced by 
1 with probability p, the fault content is not changed with probability q, and a 
new fault is generated with probability r. The obvious equality is that 



p + q + r =1 
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This implies that we have a birth-death process with a birth rate v(/) = r ■ A(t ) 
and a death rate /d(t) = p ■ A(t) . It can be depicted by the CTMC as Fig. 4.5. 




Fig. 4.5. CTMC for the birth-death process of imperfect debugging model. 



However, in order to fit failure data and obtain further applicable results, 
assumptions on the failure rate function A(t) must be made. 

Denoted by N(t) the number of remaining faults in the software at time t 
and let 

P i (t) = Pr{N(t) = i}, i=0,l,...,N 0 . 

We obtain the forward Kolmogorov equations of this Markov process as 

P'i (0 = 0 - l)v( ;t)P,.,(0 - i[v(0 + M(l)}Pi(‘) + 0 + m<)P i+l (t) , i > 0 (4.26) 

Generally, by inserting v(t) and ju(t) and using the initial condition 
P Nq (0) = 1 , the differential equations can be solved by using the probability 
generating function suggested in Kremer (1983). 



Imperfect debugging model considering multi-type failure 

In practice, software failures can be classified into different types according to 
their severity or characteristics. Different types of failures may cause different 
software reliability performance. Tokuno & Yamada (2001) presented a Markov 
model with two types of failures that have different kinds of failure rates and 
imperfect debugging process. The first type is the failures caused by faults 
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originally latent in the system prior to the testing, denoted by FI . The second type 
is the failures due to faults randomly introduced or regenerated during the testing 
phase, denoted by F2. 

They assumed that 

1) The failure rate for FI is constant between failures and decreases 
geometrically as each fault is corrected, and the failure rate for F2 is 
constant throughout the testing phase. 

2) The debugging activity for the fault is imperfect: denoted by p the 
probability for a fault to be removed successfully. 

3) The debugging activity is performed without distinguishing between FI 
and F2. 

4) The probability that two or more software failures occur simultaneously is 
negligible. 

5) At most one fault is corrected when the debugging activity is performed, 
and the fault-correction time is negligible or not considered. 

Let X(t) be a counting process representing the cumulative number of faults 
corrected up to testing time t. From the assumption 2, when i faults have been 
corrected by an arbitrary testing time t, after the next software failure occurs, 



*(/) = 



i, 

i + 1, 



with probability q 
with probability p 



(4.27) 



from the assumptions 1 and 3, when i faults have been corrected, the failure rate 
for the next software failure-occurrence is given by 

Mi) = D-k‘+6, i = 0,l, 2,..., D > 0 , 0 < A: < 1 , 0>O (4.28) 



where D is the initial failure rate for FI, k is the decreasing ratio of the failure 
rate, and 6 is the failure rate for F2. 

The reliability function to the next software failure is given by 
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/?,(/) = exp{-(D-/t' +0)t) (4.29) 

Furthermore, let Qg(*) denote the one step transition probability that after 
making a transition into state i, the process [X(t),t > 0) makes a transition into 
state j by time T . Then, we have 

G„(r)=^ -[1-expHD -k‘ +0)r}] (4.30) 

where P tJ are the transition probabilities from state i to state j. 

4.3. Modular Software Systems 

If possible, the architecture of software should be taken into account instead of 
considering the software as a black-box system. Markov models can also be 
applied in analyzing the reliability for modular software system. 

4.3.1. The Littlewood semi-Markov model 

Littlewood (1979) incorporated the structure of the software into the Markov 
process using a kind of semi-Markov model. The program is assumed to be 
comprised of a finite number of modules and the transfer of control between 
modules is described by the probability 

Pij=?r{ program transits from module i to module j) 

The time spent in each module has a general distribution F (J (t) which depends 
upon i and j, with finite mean m tJ . When module i is executed, failures occur 
according to a Poisson process with parameter A l . The transfer of control 
between modules has a probability of a failure. 

The interest of the composite model is focused on the total number of failures 
of integrated software system in time interval (0,r], denoted by Nit). The 
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asymptotic Poisson process approximation for Nit) is obtained under the 
assumption that failures are very infrequent. The times between failures tend to 
be much larger than the times between exchanges of control. The failure 
occurrence rate of this Poisson process is given by 

A =E a A+Z b ij v ij (4.3 1) 

i i.j 



where 



a, = 



*iY tJ PlJ m 0 

I>.E jPo m u 



(4.32) 



represents the proportion of time spent in module i, and 



n iPij 



(4.33) 



is the frequency of transfer of control between i and j. 



4.3.2. Some other modular software models 
User-oriented model 

Similar to the Littlewood semi-Markov model, a model called the user-oriented 
model was developed by Cheung (1980) where the user profile can be 
incorporated into the modeling. The model is a Markov model based on the 
reliability of each individual module and the inter-modular transition probabilities 
as the user profile. 

Assume that the program flow graph of a terminating application has a single 
entry and a single exit node, and that the transfer of control among modules can 
be described by an absorbing DTMC with a transition probability matrix 
P={P U )- Modules fail independently and the reliability of the module i is the 
probability /?, that the module performs its function correctly. 
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Two absorbing states C and F are added, representing the correct output and 
failure state, respectively, and the transition probability matrix P is modified 
appropriately to P . The original transition probability between the modules 
i and j is modified to RjPy. This represents the probability that the module i 
produces the correct result and the control is transferred to module j. From the 
exit state n, a directed edge to state C is created with transition probability R to 
represent the correct execution. The failure of a module i is considered by 
creating a directed edge to failure state F with transition probability 1 - R ( . 
Hence, DTMC defined with transition probability matrix P is a composite 
model of the software system. The reliability of the program is the probability of 
reaching the absorbing state C of the DTMC. 

Let Q be the matrix obtained from P by deleting rows and columns 
corresponding to the absorbing states C and F. Q k (l,n) represents the 
probability of reaching state n from 1 through k transitions. From initial state 1 to 
final state n, the number of transitions k may vary from 0 to infinity. It can be 
show that 



S = I + Q + Q 2 + Q'+- = f / Q i =(I-Q)-' 

k-Q 

and it follows that the overall system reliability can be computed as 

R = S(l,n)R„ 



( 4 . 34 ) 



( 4 . 35 ) 



Task-oriented model 

A modular software is usually developed to complete certain tasks. Kubat (1989) 
presented a task-oriented model which considered the case of a terminating 
software application composed of n modules designed for K different tasks. Each 
task may require several modules and the same module can be used for different 
tasks. Transitions between modules follow a DTMC such that with probability 
Qi(k) task k will first call module i and with probability p jj (k ) task k will call 
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module j after executing in module i. The sojourn time during the visit in module 
i by task k has the density function £;(*, t) . Hence, a semi-Markov process can 
be used. 

The failure rate of module i is A t . As shown in Kubat (1989), the probability 
that no failure occurs during the execution of task k, while in module i is 

oo 

R l (k)= fe- Ai ' gi (k,t)dt (4.36) 

o 

The expected number of visits in module i by task k, denoted by V t (k), can be 
obtained by solving 

V i (k) = q i (k) + £ i Vj(k)p iJ (k)-, 1 = 1,2. n, * = 1,2 K (4.37) 

;=i 

The probability that there will be no failure when running for task k can be 
approximated by 

KW-nW*)]’'' 10 (4-38) 

i=t 

and the system failure rate is calculated by 

A, =fr t ll -*(*)] (4.39) 

*=i 



where r k is the arrival rate of task k. 

Multi-type failure model in modular software 

Ledoux (1999) further proposed a Markov models to include the multi-type 
failures into the modular software reliability analysis. They constructed an 
irreducible CTMC with transition rates q tJ to model the software composed of a 
set of components C. In the model, two types of failures are considered: primary 
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failures and secondary failures. The primary failure leads to an execution break; 
the execution is restarted after some delay. A secondary failure does not affect 
the software because the execution is assumed to be restarted instantaneously 
when the failure appears. For an active component c, , a primary failure occurs 
with constant rate , while the secondary failures are described as Poisson 
process with rate A l2 ■ When control is transferred between two components i and 
j then a primary (secondary) interface failure occurs with probability v yl ( v y2 )■ 

Following the occurrence of a primary failure, a recovery state is occupied, 
and the delay of the execution break is a random variable with a phase type 
distribution. Denoting by R the set of recovery states, the state space becomes 
C'U R . Flence, the CTMC that defines the architecture is replaced by a CTMC 
that models alternation of operational-recovery periods. The associated generator 
matrix defines the following transition rates: from c, to c } with no failure; 
from c, to Cj with a secondary failure; from c t to Cj with a primary failure; 
from recovery state i to recovery state j; and from recovery state i to Cj . 

A Markov model is then constructed according to the architecture of 
different modules and their states. Based on the CTMC, the 
Chapman-Kolmogorov equations can be obtained and solved by computational 
tools. 



4 . 4 . Models for Correlated Failures 

Perhaps the most stringent restriction in most software reliability models is the 
assumption of statistical independence among successive software failures. It is 
common for software failures to be correlated in successive runs. In order to deal 
with this issue, Goseva-Popstojanova & Trivedi (2000) formulated a Markov 
renewal model that can consider the phenomena of failure correlation. 
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4.4.1. Description of the correlated failures 

Since each software run has two possible outcomes (success or failure), the usual 
way of looking at the sequence of software runs is to consider it as a sequence of 
Bernoulli trials, where each trial has success-probability p and failure-probability 
1 -p. Goseva-Popstojanova & Trivedi (2000) constructed a Markov renewal 
model for the sequence of dependent software runs in two stages: 

1) Define a DTMC which considers the outcomes from the sequence of 
possibly dependent software runs in discrete time. 

2) Construct the process in continuous time by attaching the distributions of 
the runs execution to the transitions of the DTMC. 

The assumptions of the model are: 

1) The probability of success or failure at each run depends on the outcome 
of the previous run. 

2) A sequence of software runs is defined as a sequence of dependent 
Bernoulli trials. 

3) Each software run takes a random amount of time to be executed. 

4) Software execution times are not identically distributed for successful and 
failed runs. 



4.4.2. Constructing the semi-Markov model 

Associated with the y:th software-run, let Z } be a random variable that 
distinguishes whether the outcome of that particular run resulted in success or 
failure: 

fO a success on run i 
Z,=\ 

[1 a failure on run j 



(4.40) 
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Here we use score 1 for each time a failure occurs and 0 otherwise. The number 
of runs that have resulted in a failure among n successive software runs is: 
S n = ofn possibly dependent random variables. 

Suppose that if run j results in failure. At run (/'+ 1), the failure probability is 
q and the success probability is q . Similarly, if run j results in success, then p 
and p are the probabilities of success and failure, respectively, at run (j+ 1). The 
sequence of dependent Bernoulli trials {Zj\j> 1) defines a DTMC with 2 
states. One is a success state denoted by 0; the other denoted by 1 is a failure. Its 
transition probability matrix is 

P P 
P= _ H 

as shown by Fig. 4.6. 




Fig. 4.6. DTMC interpretation of dependent Bernoulli trials. 



The unconditional probability of failure on run (j+l) can be derived, see e.g. 
Goseva-Popstojanova & Trivedi (2000), as 

Pr{Zj +t = 1} =p + (p + q-\)-Pr[Zj =1} (4.41) 

This equation shows the property of failure correlation in successive runs. If 
p + q - 1 , the Markov chain describes a sequence of independent Bernoulli trials, 
and the above equation reduces to: 



Pr{Z; +1 =l} =p = q 
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which means that the failure probability does not depend on the outcome of the 
previous run. When p + q > 1, runs are positively correlated, i.e. if a software 
failure occurs in run j, then there is an increased chance that another failure 
occurs in the next run. In this case, failures occur in clusters. Finally, when 
p + q < 1, runs are negatively correlated. In this case, if a software failure occurs 
in run j, then there is an increased chance that a success occurs in run (/'+ 1 ), i.e., 
there is a lack of clustering. 

The next step in the model construction is to obtain a process in continuous 
time. Let F kl «) be the cumulative distribution function of the time spent in a 
transition from state k to state l. It is realistic to assume that the runs execution 
times are not identically distributed for successful and failed runs. Hence, the 
F kl (t) depend only of the type of point at the end of the interval, i.e., 

F 00 (t) = F l0 (t) = F aS (t) 



and 



^oi (0 = F\ i (0 - F aF (/) 



With the addition of the Fid ( t ) to the transitions of DTMC we obtain a Markov 
renewal process as the software reliability model in continuous time. 



4.4.3. Considering software reliability growth 

During the testing phase, software is subjected to a sequence of runs, making no 
changes if there is no failure. When a failure occurs on any run, then an attempt is 
made to fix the underlying fault which causes the conditional probabilities of 
success and failure on the next run to change. The software reliability growth 
model in discrete time can be described with a sequence of dependent Bernoulli 
trials with step-dependent probabilities. The underlying stochastic process is a 
nonhomogeneous DTMC. 
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The sequence S n = Z, provides an alternate description of the 
software reliability growth model considered here. That is, {S.} defines the 
DTMC presented in Fig. 4.7. 



Po 





Fig. 4.7. States transition over time. 



Both states i and i s represent that the failure state has been occupied i times. 
State i represents the first trial for which S n =i . State i s represents all 
subsequent trials for which S n = i , i.e., all subsequent successful runs before the 
occurrence of next failure (7+1). Without loss of generality let the first run be 
successful which means that 0 is the initial state. 

The software reliability growth model in continuous time can also be 
obtained by assigning runs execution-time distributions to transitions of the 
DTMC in Fig. 4.7. For simplicity, we have chosen the same execution time 
distribution regardless of the outcome: 

Fqo(0 = F ai (t) = F ]0 (t) = F u (t) - F(t) 

Flence, T ex of each software run has cumulative distribution function 



F(t) = Pr {T ex <t} 
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Let X l+ , be the number of runs between failures i and i+1, From Fig. 4.7, 
the random variable X J+) (i > 1) has the following distribution: 



P r (X, +1 = k) = 



{irZh, 



k = 1 
k>2 



( 4 . 42 ) 



It follows that the distribution function of the time to failure ( i +1), given 
that the system has had i failures, i > 1 , is: 

F M (*) = q,F(0 Pi' 2 PiF k \t) ( 4 . 43 ) 

k= 2 



where F k \t) is k-fold convolution of Fit). 

The Laplace-Stieltjes transform of Fit) is F(s) , and then the above 
equation is transformed as 



F M (s) = q i F(s) + 'ZqrPt' 2 P i F k ^) 

k=2 

q i F(s) + (l- p i -q i )F 2 (s') 

1 ~ Pl F(s) 



( 4 . 44 ) 



Its inversion is straightforward and reasonably simple closed-form results can be 
obtained when F(t) has a rational Laplace-Stieltjes transform. 

Some general properties of the inter-failure time can be developed without 
making assumptions about the form of F(t). For example, the MTTF is: 

MTTF = £[T, +1 ] = _^5±1<£) | = E±±3l E[T ' x] (4.45) 

as Pi 



where E\TJ is the mean execution-time. 
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Example 4.4. Suppose that the failures of a software are correlated between 
successive runs with p 0 =OJ (from success to success state) and q Q =0.8 (from 
failure to failure state). The execution time of each run is assumed to follow the 
exponential distribution with mean ju = 30 hours. 

Substituting the above values into the above equation, we get 

MTTF = = 0 3 * °' - 30 = 50 (hours) 

Po 03 



During the testing phase, when detecting a failure, we try to remove it, so the 
dependent probabilities are changing as Fig. 4.7. If we assume p (+1 =1.02p ( 
and q j+l = 0.98^ then 

/?, =(1.02)'p 0 and q t =OM‘q 0 , i= 1,2,... 

If the customer requires that the MTTF should be longer than 100 hours, to 
determine the testing time, we should use the following in equation 

2 -(1.02)' 0.7 -(0.98V 0.8 
1- (1.02)' 0.7 



Solving this, we get i > 9.8 , so the least number of detected/debugged failures 
should be 10. Then, the expected testing time before release can be computed as 



r>30- 



I 



i=0 



2-(l ■02)'0.7 -(0.98)‘0.8 
1 — (1.02)' 0.7 



=668.55 (hours) 



That is, in order to satisfy the customer requirement, the software should be 
tested for at least 669 hours before release. 
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4.5. Software NHPP Models 

Although some basic and advanced Markov models are presented in the previous 
sections, some NHPP models are mentioned here due to their significant impact 
on the software reliability analysis. Such a model simply models the failure 
occurrence rate as a function of time (see e.g.. Section 2.4). Hopefully this 
occurrence rate is decreasing when faults are removed as an effect of debussing. 
Note that after the release, the failure occurrence rate should be a constant unless 
the debugging is continued (Yang & Xie, 2000). 



4.5.1. The Goel-Okumoto (GO) model 

In 1979, Goel and Okumoto presented a simple model for the description of 
software failure process by assuming that the cumulative failure process is NHPP 
with a simple mean value function. Although NHPP models have been studied 
before, see e.g. Schneidewind (1975), the GO-model is the basic NHPP model 
that later has had a strong influence on the software reliability modeling history. 



Model description 

The general assumptions of the GO-model are 

1) The cumulative number of faults detected at time t follows a Poisson 
distribution. 

2) All faults are independent and have the same chance of being detected. 

3) All detected faults are removed immediately and no new faults are 
introduced. 

Specifically, the GO-model assumes that the failure process is modeled by an 
NHPP model with mean value function m(t) given by 



m(0 = «[1 - exp(-bt)] , a>0,6>0 



(4.46) 
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The failure intensity function can be derived by 

A (t) = —m(t) = abe\p(-bt) (4.47) 

dt 

where a and b are positive constant. Note that m(°°) = a . The physical meaning 
of parameter a can be explained as the expected number of faults which are 
eventually detected. The quantity b can be interpreted as the failure occurrence 
rate per fault. 

The expected number of remaining faults at time t can be calculated as 

E[N (°°) - /V(r)] = m(°°) - m(t) = at\p{-bt) 

The GO-model has a simple but interesting interpretation based on a model 
for fault detection process. Suppose that the expected number of faults detected in 
a time interval lU + Af) is proportional to the number of remaining faults, we 
have that 

m(t + At) = b[a - m(/)]A/ 

where b is a constant of proportionality. 

The above difference equation can be transformed into a differential 
equation. Divide both sides by At and take limits by letting At tend to zero, 
we get the following equation, 

m'(t) = a-b-b ■ m(t) 

It can be shown that the solution of this differential equation, together with the 
initial condition m( 0) = 0, lead to the mean value function of the GO-model. 

Note that both the GO-model and JM-model give the exponentially 
decreasing number of remaining faults. It can be shown that these two models 
cannot be distinguished using only one realization from each model. However, 
the models are different because the JM-model assumes a discrete change of the 
failure intensity at the time of the removal of a fault while the GO-model assumes 
a continuous failure intensity function over the whole time domain. 
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Parameter estimation 

Denoted by n, the number of faults detected in time interval [/ M ,/,) , where 
0 = / 0 <t\ <"' <h an( J l i arc the running hmes since the software testing 
begins. The estimation of model parameters a and b can be carried out by 
maximizing the likelihood function, see e.g. Goel & Okumoto (1979). The 
likelihood function can be reduced to 

k 

, t k exp(-bt k ) • V n, 

y exp(-H) - G i exp(-fet,-_| )] (448) 

exp(-fet,_| ) - expi-btj ) 1-exp (~bt k ) 



Solving this equation to calculate the estimate of b, and then a can be estimated 
as 



i=i 

1 - exp (~bt k ) 



(4.49) 



Usually, the above two equations has to be solved numerically. It can also be 
shown that the estimates are asymptotically normal and a confidence region can 
easily be established. A numerical example is illustrated below. 



Example 4.5. Suppose a software product is being tested by a group. Each time 
when detecting the failure, it is removed and the time for repair is not computed 
in the test time. The 30 test data of time to failures are recorded in Table 4.4. 

Solving the likelihood equations, we get b = 0.0008 and a = 57. The 
failure intensity function and the mean value function for this GO model are 

A(t) = 0.0456exp( -0.0008/) 

and 



m(r) = 57[l-exp( -0.0008/)] 
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Table 4.4. A set of time to failure data. 



Failure 

number 


Time to 
failures 


Failure 

number 


Time to 
failures 


Failure 

number 


Time to 
failures 


1 


1.33 


11 


288.43 


21 


547.21 


2 


3.43 


12 


288.84 


22 


554.65 


3 


24.87 


13 


303.02 


23 


629.93 


4 


58.15 


14 


330.30 


24 


741.44 


5 


85.78 


15 


375.16 


25 


773.25 


6 


145.84 


16 


414.85 


26 


789.56 


7 


203.84 


17 


417.96 


27 


815.74 


8 


205.82 


18 


434.56 


28 


874.62 


9 


219.97 


19 


517.28 


29 


888.81 


10 


244.09 


20 


543.72 


30 


924.94 



4.5.2. S-shaped NHPP models 

The mean value function of the GO-model is exponential-shaped. Based on the 
experience, it is observed that the curve of the cumulative number of faults is 
often S-shaped as shown by Fig. 4.8, see e.g. Yamada et al. (1984). 




Fig. 4.8. The S-shaped mean value function. 
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This can be explained by the fact that at the beginning of the testing, some 
faults might be “covered” by other faults. Removing a detected fault at the 
beginning does not reduce the failure intensity very much since the same test data 
will still lead to a failure caused by other faults. Another reason of the S-shaped 
behavior is the learning effect as indicated in Yamada el al. (1984). 

Delayed S-shaped NHPP model 

The mean value function of the delayed S-shaped NHPP model is 

m(0 = a[l-(l + &0exp( - *0]; b>0, (4.50) 

This is a two-parameter S-shaped curve with parameter a denoting the number of 
faults to be detected and b corresponding to a fault detection rate. The 
corresponding failure intensity function of this delayed S-shaped NHPP model is 

A(t) = = ab(l + bt)exp(-bt) - abexp(-bt) - ab 2 t exp(-bt) 

dt 

The expected number of remaining faults at time t is then 

m(°°) - m(t) = a(l + bt)cxp(-bt) 



Inflected S-shaped NHPP model 

The mean value function of the inflected S-shaped NHPP model is 



m(/) = 



n[l - exp(-fct)] 
1 -hcexp(-bt) 



b>0,c>0 



In the above a is again the total number of faults to be detected while b and c are 
called the fault detection rate and the inflection factor, respectively. The intensity 
function of this inflected S-shaped NHPP model can easily be derived as 
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. dm(t) _ ab( 1 + c) ■ exp j-bt) 

dt [l + cexp(-ftoF 

Given a set of failure data, for both delayed and inflated S-shaped NHPP models, 
numerical methods have to be used to solve the likelihood equation so that 
estimates of the parameters can be obtained. 



4.5.3. Some other NHPP models 

Besides the S-shaped models, there are many other NHPP models that extend the 
GO-model for different specific conditions. 



Duane model 

The Duane model assumes that the mean value function satisfies 



m(t) = 




a> 0 , P> 0 



(4.51) 



In the above, CX. and /? are parameters which can be estimated by using 
collected failure data. The mean value functions with OC = 100 and different 
ft - (0.5, 1,2} are depicted by the Fig. 4.9. 

It can be noted that when ft = 1 , the Duane NHPP model is reduced to a 
Poisson process whose mean value function is a straight line. In such a case, there 
is no reliability growth. In fact, the Duane model can be used to model both 
reliability growth ( ft < 1 ) and reliability deterioration (/? > 1 ) which is common 
in hardware systems. 

The failure intensity function, A(t ) , is 

A(t) = -ym(l) = —f-) , a> 0, J3>0 

dt a\a) 





Computing System Reliability 



107 







P = \ 

P = 0.5 



t 



Fig. 4.9. Mean value functions of Duane NHPP models 



One of the most important advantages of the Duane model is that if we plot the 
cumulative number of failure versus the cumulative testing time on a 
log-log-scale, the plotted points tends to be close to a straight line if the model is 
valid. This can be seen from the fact that the relation between m(t) and t can be 
rewritten as 



lnm (0 = -/?lna + /?lnf = a + b\nt 

where a = -fi 1 n Ot and b- 0 . Hence, In m(t) is a linear function of In t and 
due to this linear relation, the parameters OC and 0 may be estimated 
graphically and the model validity can easily be verified. In fact, this is called 
first-model-validation-then-parameter-estimation approach (Xie & Zhao, 1993). 



The Duane model gives an infinite failure intensity at time zero. Littlewood 
(1984) proposed a modified Duane model with the mean value function 



m(t) a k 



f 



1 - 



a 



\a + t 



, a>0 , 0>O, k >0 
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The parameter k can be interpreted as the number of faults eventually to be 
detected. 



Log-power model 

Xie & Zhao (1993) presented a log-power model. The mean value function of 
this model can be written as 

/n(f) = a In 4 (1 + f); a,b> 0, t>0 (4.52) 

This model has shown to be useful for software reliability analysis as it is a pure 
reliability growth model. It is also easy to use due to its graphical interpretation. 
The plot of the cumulative number of failures at time t against r+1 will tend to be 
a straight line on a log-double-log scale if the failures follow the log-power 
model. This can be seen from the following relationship 

lnm(f) = In a + b In ln(l + 1) 

The slope of the fitted line gives an estimation of b and its intercept on the 
vertical axis gives an estimation of lnrz. 

The failure intensity function of the log-power model can be obtained as 



AO) = 



ab\n b ~ x {\ + t) 
1 + r 



/> 0 



(4.53) 



The failure intensity function is interesting from a practical point of view. The 
log-power model is able to analyze both the case of strictly decreasing failure 
intensity and the case of increasing-then-decreasing failure intensity function. For 
example, if b 5 1 , then A{t) of the above equation is a monotonic decreasing 
function of f; Otherwise given b > 1 , A(t) is increasing if 0 ^ f < exp(fc - 1) 
and decreasing if t > exp (b - 1) . 

The estimation of the parameters a and b is also simple. Suppose total n 
failures are detected during the a testing period (0,7] and the times to failures 
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are ordered by 0 <t 2 <••■ <t n <T . The maximum likelihood estimation of a 

and b is then given by: 




nlnln(l + 7)-£lnln(l + t,) 

i=l 



and 



. n 
a _ — 

ln"(l + r) 

They can be simply calculated without numerical procedures. 



Musa-Okumoto model 



Musa and Okumoto (1984) is another model for infinite failures. This NHPP 
model is also called the logarithmic Poisson model. The mean value function is 

m(t) = a ln(l + bt), t> 0 (4.54) 

The failure intensity function is derived as 

ab 



A(t) = 



1 + bt 



Given a set of failure time data [t jf i = 1,2,..., «} , the maximum likelihood 
estimates of the parameters are the solutions of the following equations: 



- _ n 
ln(l + bt n ) 

b ill* bt t ( l + bt n )m + bt „ ) 



(4.55) 



These equations have to be solved numerically. 
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4.6. Notes and References 

Software reliability is an important research area that has been studied by many 
researchers. Some books related to this are Musa et al. (1987), Xie (1991), Lyu 
(1996), Musa (1998) and Pham (2000). An earlier annotated biography can be 
found in Xie (1993). In addition, Ammar et al. (2000) presented a brief 
comparative survey of fault tolerance as it arises in hardware systems and 
software systems and discussed logical models as well as statistical models. 

Other than Markov models discussed in this chapter, Limnios (1997) 
analyzed the dependability of semi-Markov systems with finite state space based 
on algebraic calculus within a convolution algebra. Tokuno & Yamada (2001) 
constructed a Markov model, which related the failure and restoration 
characteristics of the software system with the cumulative number of corrected 
faults, and also considered the imperfect debugging process together with the 
time-dependent property. Goseva-Popstojanova & Trivedi (2003) presented an 
interesting study on some architecture-based approaches in software reliability. 
Becker et al. (2000) presented a semi-Markov model for software reliability 
allowing for inhomogenities with respect to process time. Rajgopal & Mazumdar 
(2002) also presented a Markov model for the transfer of control between 
different software modules. Boland & Singh (2003) also investigated a 
birth-process approach. 

For the NHPP models, Yamada & Osaki (1985) summarized some earlier 
software reliability growth models. Recently, many specific NHPP models have 
been studied. For example, Kuo et al. (2001) proposed a scheme for constructing 
software reliability growth models based on a NHPP model. Huang et al. (2003) 
further described how several existing software reliability growth models based 
on NHPP can be comprehensively derived by applying the concept of weighted 
arithmetic, weighted geometric, or weighted harmonic mean. Huang & Kuo 
(2003) presented some analysis that incorporates logistic testing-effort function 
into software reliability modeling. Zhang & Pham (2002) studied the problem of 
predicting operational software availability for telecommunication systems. 
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Shyur (2003) also presented an NHPP model that considers both imperfect 
debugging and the change-point. Pham (2003) recently presented studies in 
software reliability that includes NHPP software reliability models, NHPP 
models with environmental factors, and cost models. See Pham & Zhang (2003) 
on some further discussion on some reliability and cost models with testing 
coverage. 

Although the Markov and NHPP models are widely used in software 
reliability, some other models and tools might be also useful. Miller (1986) 
introduced “Order Statistic” models in studying the software reliability, which 
can also be found in the later research of Kaufman (1996), Aki & Hirano (1996), 
among others. Xie el al. (1998) described a double exponential smoothing 
technique to predict software failures. Helander et al. (1998) presented planning 
models for distributing development effort among software components to 
facilitate cost-effective progress toward a system reliability goal. Recently, 
Zequeira (2000), Sahinoglu et al. (2001), Littlewood et al. (2003) and Ozekici & 
Soyer (2003), among others, studied some Bayesian approaches to model and 
estimate the reliability of software-based systems. 
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MODELS FOR 
INTEGRATED SYSTEMS 



A computing system usually integrates both software and hardware, and software 
cannot work without the support of hardware. Hence, computing system 
reliability should be studied by considering both software and hardware 
components. This chapter presents some models for the reliability analysis at the 
system level by incorporating both software and hardware failures. First, a single 
processor system is studied. Second, the case of modular system reliability is 
discussed. Following that, Markov models for clustered computing system are 
presented. Finally, a unified model that integrates NHPP software model into the 
Markov hardware model is shown. 



5. 1. Single-Processor System 

The simplest case for the integrated software and hardware system is to view it as 
a single processor divided into two subsystems: software and hardware 
subsystems. Considering such system, Goel & Soenjoto (1981) presented one of 
the first, but general, Markov models, which will be described in this section. 
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5.1.1. Markov modeling 

The assumptions of the model are as follows: 

1) Faults in the software subsystem are independent from each other and 
each has a failure occurrence rate of A . 

2) Failures of hardware subsystem are also independent and have a failure 
occurrence rate of A h . 

3) The time to remove a software fault, when there are i such faults in the 
system follows an exponential distribution with parameter //, . 

4) The time to remove the cause of a hardware failure also follows an 
exponential distribution with parameter fX h . 

5) Failures and repairs of the hardware subsystem are independent of both 
the failures and repairs of the software subsystem. 

6) At most one software fault is removed and no new software faults are 
introduced during the fault correction stage. 

7) When the system is not operational due to the occurrence of a software 
failure, the fault causing the failure is corrected with probability p, and 
q s = 1 - p s , is the probability of imperfect repair of software. 

8) After the occurrence of a hardware failure, the hardware subsystem is 
recovered with probability, p h and q h = 1 - p h is the probability for the 
hardware still staying at the failed state after the repair. 

Let X(t ) denote the state of the system at time t and ‘ X (t) = i\ i=0,l,...,N, 
implies that the system is operational while there are i remaining software faults. 
Flere N is the initial number of software faults. Also, ‘ X(t) = i s 
i s = l v ,2 V N s , implies that the system is down for repair of software with i 
remaining software faults at the time of failure. Similarly, ‘ X (t) = i h 
i h =1 a ,2 a ,..., N h , implies the system is down for repair of hardware with i 
remaining software faults at the time of failure. The Markov chain is shown in 
Fig. 5.1. 
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Fig. 5.1. Markov chain for the transitions between states of X (0 . 



Suppose that the system is at state i (an operational state containing i software 
faults), i=l,2,...,N. The system may fail due to the software failure with 
probability p, to state i s and due to the hardware failure with the probability 
<?, to state i h . At state i , , debugging is taking place to remove the fault that 
causes the software failure. With probability p s , the software fault is 
successfully removed and the system goes to state i- 1. Otherwise with probability 
q s , the fault is not removed and the software is only restarted at state i. For state 
i h , maintenance personnel will try to recover the hardware failure and it has a 
probability p h to return to the operational state i and probability q h to remain 
at the failure state i h . After the software is fault-free, i.e. at the state 0, the 
system reduces to a hardware system subject to hardware failures only. 

Let Q kJ {t) be the one-step transition probability that, after transiting into 
state k, the process X(t) next transits to state j in an amount of time less than or 
equal to t. Denoted by F k j(t) the cumulative distributing function of the time 
from state k to state j. Then, Q k j (t) is the product of P t j and F k ■(/). The 
expressions for Q k j (t) in the Fig. 5.1 are as follows: 

Qu s (0 = Pd 1 ~ exp{-(/l A + U)t)} 
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Q Uh (0 = <7, [1 - exp [~U k + ‘A)i 1 1 
Q, s A0=q s [l-e\p(-Mi0i 

Qi s ,i-i (0 = Psl 1 - exp(-//,/)] (5.1) 

Qi,. i ( t ) = P h [ l-exp(-^)] 

Qi h ,, h (0=^[l-exp(-^)l 



These basic equations describe the stochastic process as a semi-Markov process 
and can be used to derive some system-performance measures, see e.g. Goel & 
Soenjoto (1981), such as time to a specified number of software faults, system 
operational probabilities, system reliability and availability, and expected number 
of software, hardware and total failures by time t. Some of the issues are 
discussed in the following. 



5.1.2. Time to a specified number of remaining software faults 

The faults remaining in the software are sources of failures and we would like to 
remove them as soon as possible. However, it is not always feasible or practical 
to remove all of the faults during a limited time period of testing. In that case, we 
would like to know the distribution of time to a specified purity level, i.e., of the 
time to n (0 <n<N) remaining faults. 

Let 7J„ be the first passage time from state i to n, and let G, n (?) be its 

distribution function. Consider a time interval (r, r + dr) . For any i, the 

probability of remaining in the state i, in this interval is dQ n (r), and the 
probability of going from the state i to i h is dQ t t (r) . 

After the process X(t ) reaches either state i s or i h , further transitions will 

be governed by distribution functions, G^ „ and G ( ^ n , respectively, G N n (t) 

can be obtained by taking the Laplace-Stieltjes transform of the renewal equation 
for G, n (t) , i = n +1,..., N, as shown in Goel & Soenjoto (1981): 
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MN-") £ . 

Gw.«(0 = *«♦!.* £ a~ N - exp (-Xjt)/Xj (5.2) 

j = \ 'Z'lN-ln.j 

where 

N 

fin+lN - mPM) 

i=n+l 

& j,N-n = (~ X j + PhMh) 

e iJ =Yl(-x J + x i ) 

i= I 



Then, it can be shown that 



“ 3(N-n) ^ 

E[T n J= \t-dG N Jt)=0 n+UN £ Q h "~ n (*;)'* 

0 j=l ^iN-in.j 

MN-n) a 

£[7’V«] = 2^ +1 , W 2 

j= 1 “3W-3 b,j 



and 



l/ar{7VJ= EfrV^-E 2 ^] 



(5.3) 



(5.4) 



(5.5) 



Example 5.1. Consider a system with IV = 10 faults, p s = 0.9 and p h = 0.9 . 
Suppose that A,- = i/i , fl i = iju , and the parametric values are A = 0.02 , 
fi - 0.05 , A h =0.01, /; h = 0.025. 

Substituting the numerical values into Eq. (5.2), we obtain the distribution of 
T n n . The distribution of T N 2 is shown in Fig. 5.2 and the trend for other 
distributions are similar to this. The means and standard deviations of these 
distributions are obtained from the above equations respectively, and some 
numerical results are shown in Table 5.1. 
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Fig. 5.2. A sample distribution of T N 2 . 



Table 5.1. Mean and standard deviation of first passage-time distribution. 



To State 


Mean 


Std. Dev. 


9 


10.2 


17.1 


8 


21.6 


25.0 


7 


34.4 


31.9 


6 


49.1 


38.4 


5 


66.2 


45.2 


4 


86.7 


47.7 


3 


112.3 


61.5 


2 


146.4 


72.9 


1 


197.7 


90.8 


0 


300.1 


133 



5.1.3. System reliability and availability 

The system reliability, or the probability that the system is operational at time t 
with a specified number of remaining software faults, can then be derived as the 
following. Let be the probability that the system is operational at time t 
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with n remaining software faults, given that it was in operation at time t - 0 
with N software faults, i.e., 

P N Jt) = ?r{X(t) = n\X(0) = N}, n = 0,1 N (5.6) 



We call P N<n (t) the (operational) state occupancy probability. By conditioning 
on the first up-down cycle of the process, as shown by Goel & Soenjoto (1981), 
the following equation for P n n (t) can be obtained: 

P n Jt) = exp{-a„ + A h )t) + Q nn <8> />„,„(/) (5.7) 

In the above, <S> is the convolution operator as in Eq. (2.35). By conditioning on 
the first passage time, we have 

P N .n«) = Pn,n(t)®G Nin (t) (5.8) 



where G N n (0 is given by Eq. (5.2). 

By taking the Laplace-Stieltjes transforms of the above equations and solving 
the resulting equations, we have 



3(Af-n)+3 

Pn , n(0 = G N „(0 ~ X 

7=1 



Aj,/y- n exp {-xjt) 

®3W-3n+3,j X j 



(5-9) 



The system availability can then be computed as 

A(t) = f j P N: „(t) (5.10) 

n=0 



An example for operational probability and system availability is shown below. 



Example 5.2. Continued with Example 5.1. The distributions Pi o, n ( t ) obtained 
from Eq. (5.9) and the availability function A{t) obtained from Eq. (5.10) are 
shown in Table 5.2 for some different time points. 
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Table 5.2. Selected values of P 10 „(O and A(t) . 



t 

n 


50 


100 


300 


500 


10 


mm 


.001 


.000 


.000 


9 




■ 


.000 


.000 


8 


am 




.000 


.000 


7 


mm 


.012 


.000 


.000 


6 


.149 


Msm 


.001 


.000 


5 


.166 




.001 


.000 


4 


.106 


.106 


.004 


.000 


3 


.036 


.159 


.011 


.001 


2 


.006 


.139 




.003 


1 


.000 


.054 




.033 


0 


.000 


.005 


■ 


.645 


A(t) 


.586 


.561 


.637 


.682 



5.1.4. Expected number of failures by time t 
Expected number of software failures 

Let M s { t) be the expected number of software failures detected by time t. 
Consider a counting process {N sj (l),i>0} , where N si (t) is the number of 
software failures detected during the time interval (0,/J, when the initial number 
of faults in the software system is i. Let M si (t) = E{N sj (t) \ X(0) = i) . By 
conditioning on the first passage time going from state N to i, we have 

M s (t) = M si (t)®G Nii (t) (5.11) 



Using the Laplace-Stieltjes transforms as shown in Goel & Soenjoto (1981), we 
get 
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N 3(/V-/+l) A (- r ■ + «.) eXD (-X.t) 

«.<«=!>.» I < 512 > 



1=1 



;=1 Ps ' Mi '®3(N-M).j 



Expected number of hardware failures 

Let M h (t) be the expected number of hardware failures detected by time I. 
Consider a counting process {N hi (t\i>0), where N hi (t ) is the number of 
hardware failures detected during the time interval (0,f], when the initial number 
of faults in the hardware subsystem is i. Let M hl (t) = E{N hl (t) | X (0) = i }• By 
conditioning on the first passage time from state N to i, we have 

M h (t) = M hi (t)®G N ' i (t) (5.13) 

Using the Laplace-Stieltjes transforms, we get 

M h (t) = f d G,(t) (5.14) 

1=0 

The expected total number of failures denoted by M(t) is the summation of 
software failures and hardware failures as 

M(t) = M s (t) + M h (t ) (5.15) 



Example 5.3. Consider the same example as in Examples 5.1 and 5.2. For this 
system, the expected numbers of software, hardware, and system failures are 
computed from the above equations. Some numerical values are given in Table 
5.3. 

Table 5.3 shows that the number of software failures detected increases 
rapidly at the beginning, leveling off at a value of about 11 at /=5(X). This 
happens because the software failure rate depends on the number of remaining 
faults and this number decreases with time. After t- 800, there are no software 
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faults left and the system is composed of a perfect software subsystem and a 
failure-prone hardware subsystem. The rate of occurrence of hardware failures, 
on the other hand, is unaffected by the passage of time. 



Table 5.3. Expected cumulative number of failure detected. 



Time 


Software failures 


Hardware failures 


Total failures 


0 


0 


0 


0 


20 


2.48 


0.14 


2.62 


40 


4.19 


0.26 


4.45 


60 


5.47 


0.38 


5.85 


80 


6.47 


0.49 


6.96 


100 


7.27 


0.61 


7.87 


200 


9.59 


1.17 


10.77 


400 


10.88 


2.44 


13.32 


600 


11.08 


3.80 


14.88 


800 


11.11 


5.18 


16.29 


1000 


11.11 


6.57 


17.68 



5.2. Models for Modular System 

Similar to the case of modular software presented in the previous chapter, 
integrated software and hardware systems can also be decomposed into a finite 
number of modules. Markov models can also be used in analyzing such modular 
systems as shown below. 



5.2.1. Markov modeling 

Siegrist (1988) might be one of the first models using Markov processes to 
analyze the modular software/hardware systems. It was assumed that the control 
of the system is transferred among the modules according to a Markov process. 
Each module has an associated reliability which gives the probability that the 
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module will operate correctly when called and will transfer control successfully 
when finished. The system will eventually either fail or complete its task 
successfully so that to enter a terminal state. 

The modules (or states) of the system is denoted by i (i=l,2,...,n). The ideal 
(failure free) system is described by a Markov chain with state space {1,2 ,...,«} 
and transition matrix P. That is, P { j is the conditional probability that the next 
state will be j given that the current state is i. The reliability of state /, denoted by 
R t , is the probability that state i will function correctly when called and will 
transfer control successfully when finished. The imperfect system is modeled by 
adding an absorbing state F (failure state) and the transition matrix is modified 
accordingly. 

Specifically, the imperfect system is described by a Markov chain with state 
space { 1,2 F\ and transition matrix P given by 

P ij = R i P ij , for i,j=l,...,n 

P iF =l-Ri, for i =1 n (5.16) 




Usually R <1 for each i and hence each of the states 1,2 ,...,« eventually 
leads to the absorbing state F. Note that the dynamics of the imperfect system are 
completely described by the state reliability function R and the transition matrix 
P since this description is equivalent to specifying the transition matrix P of the 
imperfect system. 



5.2.2. Expected number of transitions until failure 

Based on the above Markov model, Siegrist (1988) presented the expected 
number of transitions until failure as the measure of system reliability. Let M ( 
denote the expected number of transitions until failure for the imperfect system. 
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starting in state i. If the transitions of the system correspond to inputs received at 
regular time intervals, then M ■ is proportional to the expected time until failure, 
starting in state i. Two methods of computing the function M will be given. 

Let Q denote the restriction of the transition matrix P of the imperfect 
system to the (transient) states 1 ,2,...,n. Note that Q tJ = R t P tJ . Then 

XQ k =(I-Q)‘‘ (5.17) 

k-0 



It follows that 



M, =(I-Q)7' (5.18) 

Let i andj be any of the states 1,2,...,«. We have that 

Mj = + B t] M j (5.19) 



where Ajj is the expected number of transitions until the imperfect system either 
fails or reaches state j, starting in state ;; and B tj is the probability that the 
imperfect system eventually reaches state j, starting in state i. If i=j, “reaches” 
should be interpreted as “returns to” in which case, we obtain from the above 
equation 



A :: 

Mi = — 4 - 
J 1 -B, 



u 



Then, the desired result is 



AjjBjj 

M, = A, : + — — 



1 - B, 



(5.20) 



(5.21) 



From the Markov property, the matrices A and B are related to the basic data R 
and P according to the following systems of equations: 
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^=1+1 RtP*** {5.22) 

k*j 

and 

= + (5.23) 

k*j 

Moreover B t] (l - B tJ ) is the same as (I-Q)y \ namely, the expected number 
of visits to j for the imperfect system starting in state i. 

With the Markov property, the measure of expected number of transitions 
until failure is derived. As a result, this model is more appropriate for systems 
which run for fixed periods of time or which run continuously (until failure). Two 
examples of branching and sequential structures are illustrated here. 



Example 5.4. (A Branching System) A general branching system has the 
transition graph depicted in Fig. 5.3. State 1 acts as a central control which may 
pass the control to any of the states 2 ,...,« or back to itself. Each of the branch 
states 2 ,...,« can pass control back to itself or back to the center state 1. 




Fig. 5.3. The transition graph for a general branching system. 
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Given the transition matrix P and the state reliability function R and let state 1 be 
the initial state, A/, , the expected number of transitions until failure starting in 
state 1, can be computed. Note first that the imperfect system, starting in state 1 
will make at least one transition before failure or return to state 1 occurs. 
Furthermore, if the stem moves to state j on the first transition, then on average, 
the system will make 1/(1 -RjPjj) transitions until failure or return to state 1 
occurs. It follows that 



A,i - 1 + y 1 



Ml 



( 5 . 24 ) 



On the other hand, the probability that the imperfect system, starting from state 1, 
will eventually return to state 1 is 



" R,P U R:P, 
= *l*il + £ J 11 



( 5 . 25 ) 



Therefore, from Eq. (5.20) 



.. A ll 
M, = — u - 
' 1 - B, 



( 5 . 26 ) 



If n = 3 modules including a CPU, a memory and a computing software, CPU is 
the central state that any computing control starting from it and the other two 
modules are branch states that are transferred with only CPU and itself. Given a 
transition matrix 



P= 



0.1 0.3 0.6 

0.7 0.3 0 

0.8 0 0.2 



and reliability /?j = 0 . 95 , /? 2 = 0 . 9 , /? 3 = 0.85 , and by substituting the numerical 
values into the above equations, we get 
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A,, =2.077 and B u =0.8079 

The expected number of transition till failure is 

M, =10.815 

Given the expected time of each transition is 26 seconds, the MTTF is 

MTTF =10.815x26 = 281.2 (seconds) 



Example 5.5. (A Sequential System) The transition graph of a sequential system 
is given in Fig. 5.4. Note that control tends to pass sequentially from state 1 or 
state 2,..., to state n except that in each state, control can return to that state or to 
state 1 which is the initial state. 




Suppose that the transition matrix P and the state reliability function R are 
known. First note that when the system is in state i, the expected number of 
transitions until the process leaves state i is W-W . It follows that 






T~T ^k^k,k+ 1 

ih-*kP» 



1 ^ 2^22 



i-3 



l-Wu 



(5.27) 
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By a similar argument, the probability of eventual return to state 1, starting in 
state 1 for the imperfect system is 



1 "2 *22 i=3 



j-J R k F k , k + 1 






i-*A 



(5.28) 



5.3. Models for Clustered System 

Clustered computing systems use commercially available computers networked 
in a loosely-coupled fashion. It can provide high levels of reliability if 
appropriate levels of fault detection and recovery software are implemented in the 
middleware (an application layer). The application, therefore, can be made as 
reliable as the user requires and it is constrained only by the upper bounds on 
reliability imposed by the architecture, performance and cost considerations. 



5.3.1. Introduction to clustered computing systems 

A cluster is a collection of computers in which any member of the cluster is 
capable of supporting the processing functions of any other member. A clustered 
computing system has a redundant n + k configuration, where n processing 
nodes are actively processing the application and k processing nodes are in a 
standby state, serving as spares. In the event of a failure of an active node, the 
application that was running on the failed node is moved to one of the standby 
nodes. 

The simplest cluster system is one active and one standby, in which one node 
is actively processing the application and the other node is in a standby state. 
Other common cluster systems include simplex (one active node, no spare), n + 1 
(. n active nodes, 1 spare), and n+0 (all n active nodes). In a system with n active 
nodes, the applications from the failed node are redistributed among the other 
active nodes using a pre-specified algorithm. 
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Consider a general clustered computing system with n active processors and 
k spares, see e.g., Mendiratta (1998). In this system, there is a Power Dog (PD) 
attached to each processor that can power cycle or power down the processor, and 
a Watch Dog (WD) with connections to each processor that monitors 
performance from each processor and initiate failover if it detects a processor 
failure. Then, the failover information is transferred to a switching system (SS) 
that can turn on the Power Dog of the standby processors to replace the failed 
ones. 

The block diagram for this clustered system architecture is shown in Fig. 5.5 
and represents the system to be modeled. 



r 






Active 
Processors 



< 




Standby 

Processors 









Fig. 5.5. A general architecture of n + k clustered computing systems. 



5.3.2. Markov modeling 

For each processor, there are two types of failures: software and hardware 
failures. Suppose the failure rate for software is A s and for hardware A h . Those 
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failed processors may or may not be repaired, which will be discussed in the 
following, respectively. 



Model for non-repairable system 

Non-repairable system means that the processors are not repaired if they are 
failed. Thus, for the n + k clustered system without repair, the Markov model 
can be depicted by the CTMC in Fig. 5.6. 




Fig. 5.6. CTMC of n + k clustered system without repair. 



The state i in Fig. 5.6 represents the number of good processors (both active and 
standby). If i>n, the cluster system must keep n processors active, so the 
failure rate should be n(A s + A h ). If ()<;' < n , it means that no spares are 
available and the number of active processors is i. Hence, the failure occurrence 
rate is i(A s +A h ). 

Denote by P^t) (i=0,l,2,.„, n+k) the probability for the system to stay at 
state i at time instant t. The Chapman-Kolmogorov equation can be written as 

p „ + k'(0 = -n(A s +A h )P n+lc (t) 

Pj'(t) = n(A s + A h )P M (t) - n(A s + A h )P i {t), / = «,n + 1, 1 
Pi\t) = (i + 1)(4 + W+\<!) ~ '(4 + W(f ) . 

P 0 '(t) = (A s + A h )P } (t) 



(5.29) 
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We assume that the process begins from the state n+k that all the processors are 
good initially. Hence, the initial conditions are 

P„ t *(0) = l,and / , „ + *. 1 (0) = /»„ + *_ 2 (0) = ... = P 0 (0) = 0 (5.30) 

With a numerical program, one can obtain the solution of the above differential 
equations with initial conditions even for large value of n+k. 

The probability of the system failure state P 0 (t) determines the unreliability 
function. Therefore, the reliability function defined as the probability that at least 
one processor works well is 

R(t) = l-P 0 (t) (5.31) 

Moreover, we can use Laplace-Stieltjes transform to approximate the reliability 
function. For example, the state probability for the failed state after the 
transformation is 

P 0 (s) = (5.32) 

(.r + 4 )(i + ^)...(.s + /*„+*)$ 



where 

^ _ f«(4 , + ^a) ifiZn 
\i(A s +A h ) if ten 



Expanding the denominator, substituting the expression in the equation for 
w , we obtain: 



n+k 



n+k n+k 



W=-^rrH--^-IHX4 



n+k+\ 

s (=1 



S |st 



1=1 



(5.33) 



The above equation can be easily inverted using inverse Laplace-Stieltjes 
transforms 



^n+k 

(n + A)! 



m- 



^n+k - 1 

(n + k - 1)1 



n+k n+k 

EH I- 

i=l i=l 



+ ... 



Pod) = 



(5.34) 
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Generally, we use only the first term for the approximation and we have 



,n+k n+k 

Rio » i — — ru, 

(« + *)!%} 



( 5 . 35 ) 



Since we have an alternating power series, the next term will provide a bound on 
the absolute error in using this approximation: 



\Error\ ■ 



t 



n+k-[ n+k n+k 



(« + *-!)!%■ 






i i=i 



Model for repairable system 

If a system is repairable, the failed processor can be recovered with a repair rate 
jU t from state /- 1 back to state i. The Markov model is built as the CTMC of Fig. 
5 . 7 . 




Fig. 5.7. CTMC for repairable clustered systems. 



As before, the Chapman-Kolmogorov equation can be written as 

P„+k ' (0 = M n +k ^+*-.(0 ~ »(*, + W+k (0 

P i '(t) = n(A s + A h )P M (t) + +^) + /t f+! ]P ( (t), i=n n + k -1 
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/>’(/) = (i + l)(4 +X h )P i , ,(/)+/*/}_, (/)-[i(4 +X h )+ft M ]P i (t), ,«-l 

/ , o’(0 = (/l 1 +i A )/ > 1 (0-A^o(0 (5.36) 

with the initial conditions (5.30). 

Again, these equations can be solved numerically using certain computer 
programs. 



Model with different repair rates of software and hardware 

In the above model for repairable clusters, ft is the expected system repair rate 
no matter whether the failed processors are caused by software failures or 
hardware failures. Actually, the rate for repairing software failure should be 
different from that for repairing hardware failure (Lai et al., 2002). A model for 
this different repair rates is discussed here. 

Let ft. be the rate to repair one failed processor caused by software failure 
and ft by hardware failure. Then part of the CTMC can be depicted as shown 
in Fig. 5.8. 



In fig. 5.8, the transition rates are given by 

4(U) = 



and 



M‘J) = 



'nX s 


if i + j < k 


( n-i-j)X s 


if i + j > k 


nA h 


if i + j < k 


(n-i- j)A h 


if i + j> k 
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State(ij): i hardware down, j software down (on different 
processors) 

Fig. 5.8. CTMC for repairable cluster with different software/hardware 

repair rate. 



The corresponding Chapman- Kolmogorov differential equation for the 
probability that the system is in the state (i, j ) at time t is, for 

i, j 0, n + k\ i + j <n + k- 1 , 



Ki 0) = Mh P mj 0) + K 0 ~ 1. j) p ,-uj ( t ) + K O', j - 0) 

+ Ms p ij + i 0) - 1 Ms + A O'. 7 ) + ^ O'. 7 ) + ^ 0) 

The initial conditions are 

p o,o (0) = 1 and P u (0) = 0, for i, j*0 

The boundary conditions are: 

p 0. 0 (t) = Mh P l, oO) T Ms P 0,l 0) — + ^)P 0 ,oO) 



( 5 . 37 ) 



( 5 . 38 ) 
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P 0 ,J (0 = (0 + 4 (0. J - DV-, W + t*s P o,M « 

~[ju h + /*, + 4 (0, 7) + 4 (0, j)]P 0 j (0 for j = 1,2 n + k - 1 

/>,oW = /^..oW + 4(* “ l>0)f/-i,o(0 + MtPt.iU) 

-\Mh + M S + 4 (i',0) + 4 m]P lfi (0 for i = 1,2. n + k - 1 

p u (t) = 4k o' - 1 , m-x.j o) + 4 o, j - o o) 

-(ji s +Mi,)Pi,j(f) for i+ j = n + k-,0<i,j <n + k 
P n+ M = \P n ,k-x^)-M h P n ^t) (5,39) 

and 

fo,«+* 0) = 4 0)fo.n+*-l w - /ft fo.«+* 0) 

The above equations need to be solved numerically with a computer program. 
After that, the system availability for the n+k clustered system can be calculated 
by 



AO)- I^.yO) 

i‘+ j<n+k 



(5.40) 



Example 5.6. Consider a clustered system containing 2 active processors. 
Suppose that the failure rate of software is A s = 0.003 and that of hardware is 
4=0-0012 in a processor. We discuss the following three different conditions 
in the following. 

1) The case without repair 

In this case, the Markov model for this 2+0 cluster is constructed in Fig. 5.9. 
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2 ( 4 + 4 ) 4+4 




Fig. 5.9. CTMC without considering repair. 

Solving the Chapman-Kolmogorov differential equations in (5.28) with initial 
condition (5.29), we get 

P 0 (/) = [l-exp{-(4+4)/}] 2 

and the reliability function 

*(/) = 1 - P 0 (t) = 1 -[1 -exp(-0.0042/)] 2 

2) The case of the identical repair rate 

With the identical repair rate JU , the Markov model for this 2+0 cluster is 
constructed in Fig. 5.10. 



2 -( 4 + 4 ) 4+4 




Fig. 5.10. CTMC for Example 5.6 considering identical repair rate. 
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By constructing the Chapman-Kolmogorov differential equations, we get 

p 2 '(t)=^(t)-2a s +A h )p 2 (t) 

Pi (I ) = 2 :(/t, + A k )P 2 (t) + mP 0 (O-(A, + A h + ft)P x (t) 

P 0 '(t) = (A, + A„)P l (t) -vP 0 (t) 

with the initial condition P 2 ( 0) = 1, Pj (0) = P 0 (0) = 0. If the repair rate fi = 0.01 
for both software failure and hardware failure, we can obtain the availability 
function A(t) = 1 - P 0 (t) numerically, as shown in Fig. 5.11. 




Fig. 5.11. Availability function considering identical 
software/hardware repair rate. 

3) General case of different repair rates 

Considering different repair rates for software and hardware, the Markov model 
for this 2+0 cluster is constructed in Fig. 5.12. 
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State (0,0): 
State (1,0): 
State (2,0): 
State (1,1): 
State (0,1): 
State (0,2): 



initial state, all components working 

1 hardware down, 1 host working 

2 hardware down, system down 

1 hardware down, 1 software down, system down 

1 software down, 1 host working 

2 software down, system down 



Fig. 5.12. CTMC for Example 5.6 with different software and hardware 

repair rate. 



The corresponding Chapman-Kolmogorov differential equations are 

^.o(0=/^o(0 + //A«(0-(2^ + 2/i J )/o,o(0 

K o(') = 2 / u„/ , 0i 0(0 + //* P 2 ,a(t) + n s P u {t)-{Hh +*h +A) p i,o(0 

^2,oW = A^,o(*) - /f/,F 2 ,o(f) (5.41) 

K\ (0 = K fl,o(0 + M h Po,, (0 ~ (Mt, + Ms ) p u (') 

P 0 , l (t) = 2A h P ofi (t)+/z h P ll (t)+/i s P o/l (t)-(M s +J- h +A i .)P 0[ (t) 

^0,2 W = Vo.t (0 ~ M s Pq,2(0 

with initial condition /q o (0) = 1 and other probabilities 0. If the repair rates are 
M s =0.008 and Mh ~ 0.012, the availability function A(t) = 1 — /©(f) can be 
computed numerically. 
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5.4. A Unified NHPP Markov Model 

In order to incorporate the NHPP software reliability model into the Markov 
hardware reliability model, Welke et al. (1995) developed a unified NHPP 
Markov model. The unified model is accomplished by determining a transition 
probability for a software failure and then incorporating the software failure 
transitions into the hardware reliability model. Based on this unified model, the 
differential equations can be easily established and solved despite the 
time-varying software failure rates. 

The basic assumptions of this unified model are listed below: 

1) Software failures are described by a general NHPP model, with the 
probability function 

P(n,t)=?r{N(t) = n) = ^^-exp{-m(t)) , n= 0,1,2,... (5.42) 

n\ 

where m{t) is the mean value function and n is the number of failures 
occurring up to time t. 

2) The times between hardware failures are exponentially distributed random 
variable. 



5.4.1. Software failure transition probability 

The mathematical justification for implementing the NHPP model as a Markov 
process is based on the concept of nth order inter-arrival times (Drake, 1967). 
Denoted by l r the random variable of the nth order inter-arrival times and let 
f lr d) be the probability density function of l r we then have 

i+tu 

Pr{/ <l r <l + Al}~ \f lr (£)dt = f lr (DAI 

i 



(5.43) 
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This equation provides a discrete-time relationship that can be incorporated into 
the discrete-time Markov model. The time between failures in a Poisson model 
has an exponential distribution, so the same derivation used in hardware 
modeling can be used here to show: 

Pr{failure occurs in (t,t + A/] } = JAt (5.44) 

Given this approximation, the discrete-time relationship can be written in a 
slightly different form as: 

f lr {l)M = P(r~UUM (5.45) 

where P(r-lJ) is the probability that there are exactly r- 1 failures in an 
interval of duration /. 

Note that the failure intensity of NHPP is time-varying (Welke et al., 1995) 
and Eq. (5.45) becomes: 

f lr (l)At = P(r-\,l)MtW (5.46) 

Substituting Eq. (5.42) into the above equation, we have 

/,, mi =^^-exp{-m(/)}/l(0A/ (5.47) 

(r-1)! 



5.4.2. Markov modeling 

We now use the above equation to describe software state transitions in a Markov 
model. The transition we evaluate is the probability that the software remains in 
the same (operational) state, given it started in the state. Since Eq. (5.43) gives 
the probability that failure r occurs in [IJ + Al], the probability that any 
software failure occurs in this interval is simply the sum of Eq. (5.43) over all 
possible values of r. Assume that the maximum value of r is large enough to 
approximate this sum as 
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X/,,(/)A/ (5.48) 



Therefore, the probability that no failures occur in + is 

/ /o (/)A/ = l-X/, r (/)A/ (5.49) 

r = 1 

By substituting Eq. (5.47) into the above equation and performing some algebraic 
manipulation, we have 

/,„ (l)Al = l-Ami± r l)Y ' exp {-m(/)) 

r=l { ( r ‘E 

= 1 - A(t)At (5.50) 

since Al = At . The above equation means the probability that no software failure 
occurs during a short time At , so the transition probability from the operational 
state to the software failure state during the short enough At can be expressed 
as 



Pr( software failure occurs in (t,t + Ar] }=A(t)At (5.51) 

Hence, with the above transition probability, NHPP model can be integrated into 
the Markov model. For details, see Welke et al. (1995). Based on the above 
equations, the differential equations can be obtained and solved as usual. An 
example for it is illustrated below. 



Example 5.7. Suppose that a processing element contains both software and 
hardware parts. The software failures follow a classical NHPP model, the 
GO-model (Goel & Okumoto, 1979) with failure intensity function 

A s (t) = abexp(-bt) 
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and of the hardware follow the exponential distribution with parameter A h . 
Suppose a - 0.001, b = 10, ^=0.0005 and the failed system will be repaired 
with repair rate jU s = 0.01 for software and Mh - 0.02 for hardware. 

The state probabilities satisfy the following differential equations 

p 0 '(» = p 2 (‘)M h + p t m, - p 0 u)u 5 (t) + a„ ) 



p l '(t)=p 0 m s (t)-p l (t)M s 



P 2 '(t) = P 0 (t)A h -P 2 (t) Mh 

with the initial condition P 0 (0)-l,P,(0) = P 2 (0)-0, we can get the availability 
function A(t ) = P 0 (t) numerically, as shown by Fig. 5.13. 




Fig. 5.13. Availability function for Example 5.7. 






Computing System Reliability 



143 



Note that although this example implemented the GO model for software failures, 
other NHPP software models, see e.g. Xie (1991), can also be integrated into the 
unified model according to their specific conditions. 



5.5. Notes and References 

Pukite & Pukite (1998) summarized some simple models for the reliability 
analysis of the hardware and software system. Another useful reference is Kapur 
etal. (1998). 

Similar to the single-processor model presented in this chapter, Hecht & 
Hecht (1986) also studied the reliability in the system context considering both 
software and hardware. Fryer (1985) implemented the fault tree analysis in 
analyzing the reliability of combined software/hardware systems, which 
determines how component failures can contribute to system failure. Sumita & 
Masuda (1986) developed a combined hardware/software reliability model where 
both lifetimes and repair times of software and hardware subsystems are 
considered together. Kim & Welch (1989) examined the concept of distributed 
execution of recovery blocks as an approach for uniform treatment of hardware 
and software faults. Keene & Lane (1992) reviewed the similarities and 
differences between hardware, software and system reliability. Kanoun & 
Ortalo-Borrel (2000) explicitly modelled the case of hardware and software 
component-interactions . 

For the clustered systems, Laprie & Kanoun (1992) presented Markov 
models for analyzing the system availability. Later, Dugan & Lyu (1995) 
discussed the modeling and analysis of three major architectures of the clustered 
system containing multiple versions of software/hardware, and they combined 
fault tree analysis techniques and Markov modeling techniques to incorporate 
transient and permanent hardware faults as well as unrelated and related software 
faults. 
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Recently, Pasquini et al. (2001) considered the reliability for systems based 
on software and human resources. Choi & Seong (2001) studied a system 
considering software masking effects on hardware faults. Zhang & Horigome 
(2001) discussed the availability and reliability on the system level considering 
the time-varying failures that are dependent among the software/hardware 
components. Lai et al. (2002) studied the reliability of the distributed 
software/hardware systems, where Markov models were implemented by 
assuming that the software failure rate is decreasing while the hardware has a 
constant failure rate. Dai et al. (2003a) further studied the reliability and 
availability of distributed services which combined both software program 
failures and hardware network failures altogether. 
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AVAILABILITY AND 
RELIABILITY OF DISTRIBUTED 
COMPUTING SYSTEMS 



Distributed computing system is a type of widely-used computing system. The 
performance of a distributed computing system is determined not only by 
software or hardware reliability but also by the reliability of networks for 
communication. This chapter presents some results on the availability and 
reliability of distributed computing systems by considering the failures of 
software programs, hardware processors and network communication. Graph 
theory and Markov models are mainly used. 

The chapter is divided into four parts. First, general distributed computing 
system and some specific commonly used systems are introduced. Second, the 
distributed program/system reliability is analyzed and some analytical tools of 
evaluating them are demonstrated. The homogeneous distributed 
software/hardware system is then studied. The system availability is analyzed by 
Markov models and the imperfect debugging process is further introduced. 
Finally, the Centralized Heterogeneous Distributed System is studied and 
approaches to its service reliability are shown. 
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6. 1. Introduction to Distributed Computing 

The distributed computing system is designed to complete certain computing 
tasks given a networked environment, e.g. Casavant & Singhal (1994) and Loy et 
al. (2001). Such systems have gained in popularity due to the low-cost processors 
in the recent years. A common distributed system is made up of several hosts 
connected by a network where computing functions are shared among the hosts, 
as depicted by Fig. 6.1. 




Fig. 6.1. General distributed computing system. 



A typical application in distributed systems is distributed software of which 
identical copies run on all the distributed hosts. A homogeneously distributed 
system is a system for which all of the distributed hosts are of the same type, such 
as workstations from the same vendor. Applications of identical copies of 
distributed software to homogeneously distributed systems are called 
homogeneously distributed software/hardware systems (Lai et al., 2002). 

For example, a search engine system provides the service for searching 
related information. To receive and serve millions of searching requests 
everyday, the search engine system should contain many servers of the same type 
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running the identical software in exploring the database. Such system is a type of 
the homogeneous distributed software/hardware system. Examples of 
applications of this kind of systems can also be found in communication 
protocols, telephone switching systems, web services, and distributed database 
management systems, etc. 

Besides the homogeneous distributed systems, most of the other distributed 
systems can be attributed to centralized heterogeneous distributed system (Dai et 
al. 2003a). This kind of system consists of many heterogeneous subsystems 
managed by a control center. 

For example, in modern warfare, each soldier can be considered as an 
element in a military system and furnished with different electrical equipments 
for diverse purposes. The information collected from each soldier is sent back to 
a control center through wireless communication channels. Then, the control 
center can analyze all the information and send out commands to respective 
soldiers. The functions of different groups of soldiers are diversified in a war 
(such as attacking, defending, supplying, saving etc.) so their electrical 
equipments should also be heterogeneous. Thus, it is a typical Centralized 
Heterogeneous Distributed System, as depicted by Fig. 6.2. 




Fig. 6.2. A military system. 
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The reliability of distributed system is a key point of the QoS (Quality of 
Service). However, reliability analysis of such systems is complicated due to its 
various topologies, the integrated software and hardware or highly heterogeneous 
subsystems. This chapter studies these issues and presents models and analytical 
tools that can be easily implemented to estimate the reliability and availability of 
those distributed systems. 



6.2. Distributed Program and System Reliability 

A general distributed computing system consists of processing elements (nodes), 
communication channels (links), memory units, data files, and programs. These 
resources are interconnected via a network that indicates how information flows 
among them. Programs residing on some nodes can use/load data files from other 
nodes. Hence, the program/system reliability in the general networked 
environment is worth studying in order to comprehensively qualify the 
distributed system. 



6.2.1. Architecture and reliability model of distributed systems 
General architecture of distributed computing systems 

A typical distributed system can be viewed as a two level hierarchical structure 
(Pierre & Hoang, 1990). The first level consists of the communication sub- 
network, also called the backbone. It comprises of linked switching nodes and has 
as its main function the end-to-end transfer of information. The second level 
consists of nodes/terminals, such as processors, programs, files, resources and so 
on. 

In general, n-processor distributed systems can be depicted as Fig. 6.3. Each 
node can execute a set of programs PN i and share a set of data files FN ( 
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(i- 1,2,... ,n). Programs residing on some nodes can be run using data files at other 
nodes. 




Fig. 6.3. n processors of a distributed computing system. 



Reliability model 

Based on the above model for the general distributed computing systems, the 
definition of the distributed program reliability is given below: 



Definition 6.1. Distributed program reliability in a distributed computing system 
is the probability of successful execution of a program running on multiple 
processing elements and needs to retrieve data files from other processing 
elements. 



From the definition, the distributed program reliability varies according to 



1) the network topology of the distributed computing system 
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2) the reliability of the communication links 

3) the reliability of the processing nodes 

4) the data files and programs distribution among processing elements 

5) the data files required to execute a program. 



Example 6.1. Consider the distributed computing system shown in Fig. 6.4. 




Fig. 6.4. A distributed computing system with four nodes and five links. 



This distributed computing system consists of four processing nodes 
(N1,N2,N3,N4) that run three different programs (P1,P2,P3) distributed in 
redundant manner among the processing elements. Four data files (F1,F2,F3,F4) 
are also distributed in a redundant manner. FP i (/= 1,2,3) is the set of files that are 
required by the program P i . 

In Fig. 6.4, program P t can run successfully when either of N1 or N4 is 
working and it is possible to access the data files (F1,F2,F3). If P t is running on 
N1 which holds the files FI and F2, it is required to access the file F3 which is 
resident at N2 or N4. That is, additional nodes and links are needed to have 
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access to that required file (F3). Thus, the distributed program reliability depends 
on the reliability of all those involved processing nodes and communication links. 



The distributed program reliability measures the reliability of one program in 
the system. However, for reliability of the distributed computing systems, it is 
important to obtain a global reliability measure that describes how reliable the 
system is for a given distribution of programs and files (Hariri & Mutlu, 1995). 
The definition of distributed system reliability is given below. 



Definition 6.2. Distributed system reliability is the probability that all the 
distributed programs are executed successfully under the distributed computing 
environment. 



As the distributed computing system depicted by Fig. 6.4, all three programs 
(P1,P2,P3) are required to be successfully achieved. Four data files (F1,F2,F3,F4) 
are needed when running those programs. Thus, the distributed system reliability 
here is the probability for all the three programs to be successfully executed 
meanwhile accessing to all the data files. 

In order to estimate the distributed program/system reliability, some 
assumptions of the reliability model for the distributed computing system, see e.g. 
Kumar et al. (1986), are given below: 

Assumptions: 

1) Each node or link in the distributed computing system has two states: 
operational or faulty. 

2) If a link is faulty, information cannot be transferred through it. 

3) If a node is faulty, the program contained in the node cannot be 
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successfully executed, the files saved in it cannot be accessed by other 
nodes, and the information is not able to be transferred through it. 

4) The probability for a processing node ( /V, ) to be operational is constant, 
which is denoted by p , and q t = 1 - p t . 

5) The probability for a communication link (Z., ; ) to be operational is also 

constant, which is denoted by p^ and q tJ = 1 - . 

6) Failures of all the nodes and links are statistically independent from each 
other. 

It is indicated in Lin & Chen (1997) that computing distributed reliability is an 
/VP-hard problem (Valiant, 1979) even when the distributed computing system is 
restricted to simple structures such as series-parallel, a tree, a star etc. Hence, 
general and effective analytical tools are required to evaluate its reliability. 



6.2.2. Kumar’s analytical tool 

This analytical tool was presented by Kumar et al. (1986), which is based on 
Minimal File Spanning Tree ( MFST ). In general, the set of nodes and links 
involved in running the given program and accessing its required files form a tree. 
Such tree is called File Spanning Tree (FST) defined below. 



Definition 6.3. File Spanning Tree is a spanning tree that connects the root node 
(the processing element that runs the program under consideration) to some other 
nodes such that its vertices hold all the required files for executing that program. 



The smallest dominating file spanning tree is called MFST and its definition is 
given below. 
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Definition 6.4. A Minima File Spanning Tree, denoted by MFST t , is an FST 
such that there exists no other file spanning tree, say FSTj , which is a subset of 

MFSTj . 



An example of the FSTs and MFSTs is illustrated below. 



Example 6 . 2 . Continue considering the distributed system of Fig. 6 . 4 . The 
following are some FSTs that make FJ run successfully on N t : { , N 2 , L n }; 

{ W| , N 2 , A/j , , L 2 3 }i { N, , £V 2 , , Z.|2 , L, 2 4 } , { A£, , N 2 , Nj , £33 , £^3 } , 

{ i ^3 i ^4 1 £33 1 £*34 } i { » ^2 ’ ^3 1 ^4 * £*|2 * ^23 * £*34 ) * 

{ IV, , N 2 , N 2 , , £,, 2 , £« , L34 }; { £V, , *2 , A£, , N< , L,, , £« . }: 

Likewise, there will be several other FSTs when 

program P t runs on N t . 



The file spanning tree [ , N 2 , N t , L n , L j 4 } is not minimal because its 

subset { N t ,N 2 ,L i2 ] is also an FST. We are interested in finding all the MFSTs 
to run a distributed program. For FJ to run on either A', or N t , four MFSTs are 
contained. They are 



{N i ,N 2 ,L l 2 UN [ ,N 2 ,N 3 ,L l 2 ,L 2 i y,{N 3 ,N„L 3 A HN 2 ,N 2 ,N 2 ,L 2 i ,L u }. 



Anyone of these four MFSTs can provide a successful execution of the 
program under consideration when all elements are working. 



From the above example, it can be seen that the distributed program can run 
successfully if any one of the MFSTs is operational. Hence, the distributed 
program reliability can be generally described in terms of the probability having 
at least one of the MFSTs operating as 

DPR=Pr(at least one MFST of a given program is operational) 
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This can be written as 



DPR=Pr 



(J MFSTj 

\j = 1 



J 



( 6 . 1 ) 



where n m j st is the total number of MFSTs that run the given program. The 
evaluation of the reliability of executing a program on a distributed system can be 
determined by the following two stages. 



Stage 1. Find all the Minimal File Spanning Trees: 

The purpose of this stage is to search all the MFSTs in which the roots are the 
processing elements that run a program, say P i . The minimal file spanning trees 
are generated in nondecreasing order of their sizes, where the size is defined as 
the number of links in an MFST. At first, all MFSTs of size 0 are determined; this 
occurs when there exist some processing nodes that run P t and have all the 
needed files (which is denoted by the set FP t ) for its execution. Then, all MFSTs 
of size 1 are determined; these trees have only one link which connects the root 
node to some other node, such that the root node and the other node have all the 
files in FP r This procedure is repeated for all possible sizes of MFSTs up to n- 1, 
where n is the total number of nodes in the system. The detailed description of 
the algorithm to search all the MFSTs is given by Kumar el al. ( 1986). 

Stage 2. Apply a terminal reliability algorithm to evaluate distributed 
program reliability: 

Here we find the probability that at least one MFST is working which means that 
all the edges and vertices included in it are operational. Any terminal reliability 
evaluation algorithm based on path or cutest enumeration can be used to obtain 
the distributed program reliability of the program under consideration. 
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The distributed system reliability can be written as the probability of the 
intersection of the set of MFSTs of each program, which is 



( ¥ 



DSR = Pr 



H MFST(P m ) 






(6.2) 



where MFST(P m ) denotes the set of all the MFSTs associated with the program 

P m - 



Example 6.3. An example using the above analytical tool to estimate distributed 
program/system reliability is illustrated below. 




Fig. 6.5. A six-node distributed computing system. 



The distributed computing system shown in Fig. 6.5 consists of six processing 
elements that can run four distributed programs and save six data files. The files 
needed for executing these programs are indicated in the following sets: 

FP X =[F„F 2 ,F 3 ), FP 2 =[F 2 ,F t ,F 6 ), 
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FP 3 ={F l ,F 3 ,F s ), FP 4 ={F lt F 2 ,F 4 ,F 6 



Evaluating the distributed program reliability 

We first derive the reliability of program P l , denoted by DPR(P t ) . The program 
P t can run on either N 1 or N6. Its MFSTs can be found by the step 1 as depicted 
by Fig. 6.6. The double-line circles represent the root node, the single-line circles 
represent the contained vertex, the number in the circle is the node number in the 
distributed computing system, and the files marked around the circles are reached 
new data files in that node. 




Fig. 6.6. MFSTs for DPR(P t ) 
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Based on the generated MFSTs in Fig. 6.6, using the terminal reliability algorithm 
as step 2, we can obtain the reliability of program P t . If we assume that all the 
elements (processing nodes and communication links), of the distributed 
computing system shown in Fig. 6.5 have the same reliability and equal to 0.9, 
then the reliability of executing P { is computed as 0.9378. 



Evaluating the distributed system reliability 

The first step in evaluating the distributed system reliability is to determine all the 
MFSTs for each program that can run on the system. The next step of this 
algorithm is to determine the set of all MFSTs that guarantee successful execution 
of all the programs by recursively intersecting the MFSTs of each program. 

The final step is applying terminal reliability algorithm to obtain the 
following terms for the distributed system reliability. If we still assume that all 
the nodes and links have the same reliability, say 0.9, then the reliability of the 
distributed computing system is 0.842674. 



6.2.3. A family of FST analytical tools 

For analyzing the distributed program/system reliability, another family of File 
Spanning Tree (FST) analytical tools is further developed, which shows good 
efficiency for some specific problems. 

The first FST analytical tool in this family was presented by Chen & Huang 
(1992) without considering node failures (i.e. just consider the communication 
failures of the network links). Hence, this analytical tool is only suitable for the 
distributed computing systems whose processing elements are perfect or highly 
reliable so that the probability for them to fail when working is negligible. 
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The main difference between this and above Kumar’s tool is that Kumar’s 
MFST starts from one root node and further expands the trees; and Chen-Huang’s 
FST starts from the whole system graph and cuts links to prime the trees. 
Moreover, Kumar’ s tool requires to further use an additional terminal reliability 
algorithm to derive distributed program/system reliability but this FST tool can 
directly obtain the solution when priming the trees. 



The basic concept of the FST analytical tool 

The basic idea for the FST analytical tool is to find all disjoint FSTs in each size 
starting from the origin graph representing the distributed computing system. If 
we use an efficient method to cut one link each time from the graph at a different 
place to generate possible subgraphs recursively, then we are able to predict if 
each of these resulting subgraphs is an FST by examining the set of data files 
contained in this subgraph against the set of required data files for executing the 
distributed programs. This process can be repeated starting from graph size K, K- 
1,..., to 0 (where K is the number of links in the graph). Obviously, without an 
efficient method to remove appropriate links, the efficiency for the analytical tool 
could be very poor. 

The method for cutting the graph plays an important role in finding the FSTs 
and in computing the reliability of the distributed computing system. The brief 
introduction for this method is given by the following five steps: 

Step 1 . Find a spanning tree from the current graph if necessary and compute 
ST a (ST a : a set of link states that can be used to construct the spanning 
tree of subgraph G), where each link L tJ has three states: 1) faulty state, 
denoted by 0; 2) operational state, denoted by 1; 3) undetermined state, 
denoted by *. 

Step 2. Compute the vector LSP a by ST C A LS C , and convert vector lsp g 
to the probability expression. ( LSP G : a set of link stats that can be used to 
compute the probability of subgraph G. The state condition could be 





Computing System Reliability 



159 



either 1,0, or * as above; LS G : a set of link states that represents the links’ 
conditions in the current subgraph G). 

Step 3. Cut the current graph according to the vectors ST C and NC a to 
obtain its subgraphs (or FSTs), where NC c denotes a set of link states 
that indicate which link cannot be cut in subgraph G. 

Step 4. Repeat steps 1 to 3 to compute each subgraph’s vector LSP C . 

Step 5. The reliability of the distributed computing system graph is obtained 
by uniting all LSP C vectors that are associated with all the FSTs. 

Once the concept of finding all FSTs and computing the reliability of the 
distributed computing system is understood, the detailed algorithm for finding the 
FSTs and computing the reliability of the FST was illustrated in Chen & Huang 
( 1992). An example for the FST analytical tool is illustrated below. 



Example 6.4. Consider the simple distributed computing system in Fig. 6.4 again. 
We use the FST reliability analytical tool to analyze the distributed 
program/system reliability. For the program P i , its reliability DPR(P { ) is 
evaluated by the splitting snapshot of subgraphs generated by the above FST tool. 

To compute the reliability, sum all the disjoint terms represented by 
vectors LSP C , and then 

DPR(P X ) = p n + 912 P 13 P 23 +< ?12<?23P34 + *7l2^I3/ ? 23 < 724/ ? 34 + QnQuPilPli 

Similarly, the distributed system reliability can be obtained from the above FST 
tool as 

DSR = p }2 Pj3 + + PnPnQitPuPu + QnPnPn 

(6.4) 

+ 9i2923P24P34 + QnPnPuQuPu + QnQiiPiyPu 

If we assume all the links have the same reliability 0.9, then the 

DPR(P l )= 0.99891 and DSR =0.9963 





160 



Reliability of Distributed Systems 



The FST-SPR (Series and Parallel Reduction) analysis 

How to speed the reliability evaluation process up is the major concern of the 
proposed analytical tool. The basic principle of speeding the reliability evaluation 
is to generate correct FSTs with less cutting steps. There are four methods 
presented by Chen & Huang (1992), which can be used interchangeably to speed 
the reliability evaluation. These methods are nodes merged, parallel reduction, 
series reduction, and degree-2 reduction as described below. 

1) Nodes merged occurs when a probability subgraph has bit value 1 in its 
LS vector, i.e. the corresponding link must be operational in all its 
subgraphs. Hence the two nodes connected by this link can be merged 
into one node together with the link itself. 

2) Parallel reduction occurs when a probability subgraph contains two or 
more links between two nodes. With connectivity property, these 
redundant links can be reduced to one link and the operational probability 
is replaced by 

Pij =l -Y[S { -Pijk) ( 6 - 5 ) 

where p jJk is the operational probability for the k:th link between nodes i 
and j. 

3) Series reduction occurs when a probability subgraph has a node, with 
node degree=2 (i.e. two links connect to this node), that contains no data 
file required for executing the distributed program. Since such a node, 
after deletion, still does not affect the correct FST generation, we can 
remove this node and reduce two links that connect to its neighboring 
nodes into one link. The new operation probability between the two 
neighboring nodes ( i and j) is replaced by 



Pij ~ PikPkj 



(6.6) 
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where k is the deleted node. 

4) Degree-2 reduction occurs when a probability subgraph has a node, with 
node degree = 2, that is not a leaf node of any MFST of the current graph. 
Since this node is not a leaf node of any MFST \ then the two adjacent 
links of this node must be working or fail simultaneously, thus we can 
remove this node and reduce two links that connect to its neighboring 
nodes into one link, and copy the data files and programs in this node to 
either of its two neighboring nodes. The new operation probability 
between the two neighboring nodes ( i and j) is replaced by 

Pij=P,kPkj ( 6 - 7 ) 

where k is the deleted node. 



Note again, similar to FST analytical tool, this FST-SPR also assumes that the 
processing elements (i.e., nodes) in the distributed computing system is perfect. 
Hence, this analytical tool is also only suitable for the distributed computing 
systems whose processing elements are perfect or highly reliable so that the 
probability for them to fail is negligible when running the programs. 

An example for the reduction methods of the FST-SPR is shown below. 



Example 6.5. Suppose there is a subgraph generated as depicted by Fig. 6.9. 

We need to compute the reliability of program P l which requires data files 
F1,F2,F3,F4 for completing its execution. The states of all the links are 
represented by different types of lines (dashed line: failure ; double line: 
operational ; single line: undetermined ) and also by vectors LS and NC. The 
following are reduction steps for speedup the FST generation. 
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: failure 

, : operational 

: undetermined 

LS=1**0** 

NC= 100000 



Fig. 6.7. A subgraph during reliability evaluation process. 




-e — o o 





o 



Fig. 6.8. Reduction steps for the subgraph of Fig. 6.7. 



Step 1: Since link L n can no longer be cut and must be up for the rest of its 
subgraph generation due to LS=1**0**, we apply nodes merged 
reduction on nodes N1 and N2. The resulting subgraph(b) is shown in Fig. 
6 . 8 (b). 

Step 2: A parallel reduction can be applied on the resulting subgraph (from 
step 1) since links L 13 and £23 are parallel. The new resulting subgraph (c) 
is shown in Fig. 6 . 8 (c). 

Step 3: A series reduction occurs since node N5 contains no data files for the 
execution of P K . The new resulting subgraph (d) is shown in Fig. 6 . 8 (d). 
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Step 4: A degree-2 reduction occurs since node N3 is not a leaf node of any 
MFST. The final subgraph (e) after these reductions is also shown in Fig. 
6.8(e). 



6.3. Homogeneously Distributed Software/Hardware 
Systems 

A typical kind of application on distributed systems has a homogeneously 
distributed software/hardware structure. The physical system is assumed to 
contain N software subsystems (SW1-SW1V) running on N hosts (HWl-HWiV) as 
depicted in Fig. 6.9. 




Fig. 6.9. A general homogeneous distributed software/hardware system. 



That is, identical copies of distributed application software run on the same type 
of hosts, called Flomogeneous Distributed Software/Hardwaie System. This 
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system may be implemented to provide services for uncorrelated random requests 
of customers. 

In this system, the software is usually improved. Since the system considers 
combined software and hardware failures as well as maintenance process, its 
reliability cannot be simply estimated by the above analytical tools for computing 
the distributed program reliability. The availability models and analyses of the 
homogeneous distributed software/hardware system are studied here. 



6.3.1. Availability model 

Actually, homogeneous distributed software/hardware system is a type of cluster 
system, which is a collection of computers in which any member of the cluster is 
capable of supporting the processing functions of any other member Mendiratta 
(1998) and Lyu & Mendiratta (1999). A cluster has a redundant n+k 
configuration, where n processing nodes are necessary and k processing nodes are 
in spare state, serving as backup. In this subsection, our model is a cluster of N 
homogeneous hosts that are working in parallel. This means that if all of the N 
hosts failed, the system fails. Otherwise whenever 1 host can work, the system is 
still working. 

The following are the assumptions concerning this system: 

(a) All the hosts have the same hardware failure rate A h arising from an 
exponential distribution. 

(b) Each of the hosts runs a copy of the same software with a failure rate 
function A s (/) of a given software model. 

(c) Both the software and hardware have only two states, up (working state) 
and down (malfunctioning state), which means all the failures of software 
or hardware are crash failures. 

(d) There are maintenance personnel to repair the system upon software or 
hardware malfunction. The repair time has an exponential distribution 
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with parameter fu s for software and parameter fi h for hardware, 
respectively. 

(e) All the failures involved (either software or hardware) are mutually 
independent. 

(f) No two or more failures (either software or hardware) occur at the same 
time. 

There are some real cases of homogeneously distributed software/hardware 
system in which all the hosts can work independently for random/unknown 
request. Such applications can be found in search engine system, telephone 
switching system and banking system, and so on. Most homogeneous distributed 
software/hardware systems that work independently under the case of 
uncorrelated random requests can implement our models. 

Systems in practice can be complex and usually we have a multi-host 
situation. Lai et al. (2002) used a Markov process to model this type of system. 
Fig. 6.10 illustrates a partial system state transition of the Markov process, in 
which (i, j) is the state when i hosts suffer hardware failures and j hosts suffer 
software failures. 

The corresponding Chapman-Kolmogorov differential equation for the 
probability that the system is in the state (i,j) at time t is, for 

i,j*0,N;i + j£N-\, 

P'u (0 = Mh p i + \j ( t) + (N-i-j + 1)4 P t _ Xj (t) 

( 6 . 8 ) 

+(N -i-j + 1)A S (0 P LH (t) + MsPij+i (0 - x u P u (0 

where 

x Uj = jj s + (N — i — j)A h + (N - i - j)A, ( t ) + n h (6.9) 

The initial conditions are 

Po.o(0) = 1 and /J /O) = 0, for ij * 0 



( 6 . 10 ) 





166 



Reliability of Distributed Systems 




State(ij): i hw down, j sw (on different hosts) down 



Fig. 6.10. The partial state transition graph for the jV-host system. 

The boundary conditions are: 

**0,0(0 = *,o(0 + Ms 1 o,i(0 - M4(0 + 4 ]*o,o(0 

*o,y(0 = Mh P \j (0 + (N~j + 04 (0*o,y-i (0 + (0 

~{Mh + Ms + ( N ~ MK (0 + 4 ]}*o,y (0 for j = 1,2,..., TV - 1 
*?!o (0 = /***?+ i.o (0 + (A(-i + l)4*/-i,o(0 + /*,*?.! (0 

( 6 . 11 ) 

-{//*+^+((V-0[4(0 + 4]}^o(0 for i = 
p ij (0 = (N-i-j + 1)4/}-,., (/) + (AT - i - j + 1)4 (/)*),_, (0 
-Gw* +i“*)*/,y(0 for i + j = n + k;0 < i,j < n + k 

*V,o(0 = 4*AM,o(0 _ /4s*V ,o(0 
*o,w(0 = 4(0*0, w-1 (0 - Ms P o.A f ) 
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The system availability for the /V-host based system can be calculated by 



A(t)= £/?,//) (6.12) 

i+j<N 

Here, we assume each copy of software suffers a failure rate of the JM model 
(Jelinski&Moranda, 1972), i.e., 

= M (6.13) 



To solve the above differential equations, we need to know the expected 
number of remaining software faults ( k , ). However, since k, changes with 
software debugging, it is usually a function of time. We have used the following 
scheme for the numerical calculation, as shown by Lai et al. (2002). According to 
the JM model, the probability of software having k remaining faults at time t is 



P(k,t) = 



r K ^ 

0 exp(-Affr) ■ [1 - exp(-^f)] /r °" < for 0 5 k <, K 0 (6. 
{k J 



14) 



Based on this equation, the expected number of remaining software faults at time 
t can be computed as 

/>(*,/) 

k= 0 



The system availability can be computed using any available numerical 
algorithm to solve the differential equations. An example using our above 
Markov model to analyze availability of homogeneous distributed 
software/hardware system is numerically illustrated below. 



Example 6.6. We assume that the hardware failure rate is 0.02 and software 
failure rate per fault is 0.006. The repair rate for hardware is 0.1 while that for 
software is 0.12. Fig. 6.11 depicts the result of system availability of a triple-host 
system with different number of initial faults. 
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It can be seen from Fig. 6.11 that the system availability reaches the lowest point 
at an early stage. This is because a large number of faults are identified when 
software system testing begins. System availability starts recovering after the 
lowest point and approaches a certain value less than 1 asymptotically after a 
longer period of time. This is because identified faults are fixed and as a result 
software failure rate decreases. 



6.3.2. Model of the imperfect debugging process 

In the above section, the homogeneous distributed software/hardware system 
model assumed that the debugging process was a perfect one. It is possible in 
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reality that the fault that is supposed to have been removed may cause a failure 
again. It may be due to the spawning of a new fault because of imperfect 
debugging, see e.g. Fakhre-Zakeri & Slud (1995), Sridharan & Jayashree (1998), 
Pham et al. (1999) and Tokuno & Yamada (2000). 



Markov modeling 

The assumptions used in this imperfect debugging model are almost the same as 
the assumptions ( a-f ) in earlier model except that the following assumption is 
added. 

(g) When a software failure occurs, instantaneously repair starts with the 
following debugging probabilities: 

The software fault content is reduced by one with probability p. 

The software fault content remains unchanged with probability r. 

The software fault content is increased by one with probability q. 

This assumption is same as the birth-death process that was introduced in Kremer 
(1983). 

Fig. 6.12 illustrates a partial system state transition, in which (i, j, k) is the 
state when i hosts suffer hardware failures, j hosts suffer software failures and k is 
the number of remaining software faults at that time. Here N is the total number 
of hosts in the system. 

The corresponding Chapman-Kolmogorov differential equation for the 
probability that the system is in the state ( i , j, Ic), P t ■ k (t) , at time t can be 
obtained. They can be solved numerically or analytically in some cases. The 
system availability at time 1 can be calculated as 

V-lV-i-l 

40=1 I Z0.y.*< 0 

/=o y=o jt=o 



( 6 . 15 ) 
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Fig. 6.12. The state transition graph for the N-host system. 



Although those differential equations can be solved, the procedure becomes 
difficult when the number of hosts is large. Hence, some computing tools can be 
used to solve them. An example is illustrated below. 



Example 6.7. In this numerical example, the software failures are assumed to 
follow the JM-model. For the multi-host systems with different number of hosts, 
the system availability functions can be obtained numerically. The curves of 
system availability functions for (N= 2, 3, 4, 5) are depicted in Fig. 6.13 with 
parameters 



/4,=0. 1 536, /*=0.1331, p=0.831, <7=0.078, 
/=0.091, /Co = 42, <f> = 0.00 1 3 and X h =0.005 
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Fig. 6.13. The curves of system availability of different number of hosts. 



Fig. 6.13 shows a similar trend as that of Fig. 6.11. System availability reaches 
the lowest point at an early stage. After that period, system availability starts 
recovering because identified faults are fixed and as a result software failure 
occurrence rate decreases. 



6.4. Centralized Heterogeneous Distributed Systems 

Most of the distributed service systems can be modeled by a centralized 
heterogeneous distributed system. This type of distributed systems consists of 
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heterogeneous subsystems that are managed by a control center, see e.g. Hussain 
& Hussain (1992) and Langer (2000, pp. 188-217). The system is different from 
the systems and models in the above sections, because those models either 
assumed constant operational probability without reliability growth or excluded 
the network reliability. However, the system incorporates not only the 
hardware/software/network reliability but also the improvement of the control 
center through debugging/maintenance process. Dai et al. (2003a) has analyzed 
its service reliability, and the results are summarized in the following. 



6.4.1. Service of the system and its reliability 

The structure of the Centralized Heterogeneous Distributed System is depicted by 
Fig. 6.14. The control center may consist of many servers. These servers support 
a virtual machine. The virtual machine can manage programs and data from 
heterogeneous subsystems through virtual nodes. The virtual nodes can mask the 
differences among various platforms. They are a kind of virtual executing 
elements, which only includes a basic unit for executing data, i.e. CPU and 
Memory. The entities of virtual machine and virtual nodes are supported by the 
software and hardware in the control center. 

The heterogeneous sub-distributed systems are composed of different types of 
computers with various operating systems connected by different topologies of 
networks. These subsystems exchange data with virtual machine through System 
Service Provider Interface (SSPI). They are connected with virtual nodes by 
routers. They can cooperate to achieve a distributed service under the 
management of the virtual machine. 

In fact, most of service systems can be categorized as centralized 
heterogeneous distributed systems such as the example of military system shown 
in Fig. 6.2. 
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Fig. 6.14. Structure of the centralized heterogeneous distributed 
service system. 



The whole process for a service in a distributed system is repeated so the 
reliability analysis of a distributed service is crucial for a distributed system. The 
distributed service reliability is defined as below. 



Definition 6.5. Distributed service reliability is the probability for a service to be 
successfully achieved in a distributed computing system. 



6.4.2. Model of distributed service reliability 

In a distributed service system, a service includes various distributed programs 
completed by diverse computers. Some later programs might require several 
precedent programs to be completed. Every program requires a certain execution 
time. The execution of some programs might require certain input files that are 
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saved or generated in different computers of the distributed systems. The overall 
distributed service reliability depends on the reliability of a program, the 
availability of input files to the program and the system reliability of the 
subsystem. 

The reliability of a service is determined by the distributed programs 
reliability in each subsystem and the availability of the control center. If a service 
can be achieved successfully, the programs running in every subsystem must be 
successful. The virtual machine should be available at the moment when any 
program needs certain input file prepared in the virtual machine. It has to be also 
available during the period when the programs are being executed in the virtual 
machine. 

It can be obtained through the critical path method, see e.g. Hillier & 
Lieberman (1995), that the time point when the programs require the files 
prepared in the virtual machine ( ), j=\,2,..J. We can also obtain the starting 
time when the programs run in the virtual machine ( T ^ p ) and the corresponding 
execution time period for those programs (T* x ), k=\,2,..,J{. 

It is noted that A(t) is the availability of the virtual machine at time t. We 
also assume that the programs require input files at the beginning time, , so 
the availability of the input files can be calculated as 

P f U) =A{Tj f ),r\,2,..J ( 6 . 16 ) 

It is assumed that the virtual machine has to be available from the beginning to 
the end when a program runs in it; otherwise, the program fails. The average 
availability of the programs, which start at time T£ p with the execution time 

period T * x , can be calculated as 

h k P +r * k * 

P pr (k)= jA(t)dt/T, *, k=l,2,...,K 



( 6 . 17 ) 
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Let fV be the number of subsystems. The distributed system reliability for the 
z':th subsystem is denoted by DSR/ (i= 1 , 2 ,. ..JV) where the virtual machine is 
viewed as a perfect node in each sub-distributed systems at first. The DSR, 
(i-l,2,...,N) can be computed by the various algorithms presented in the previous 
section. Then, the availability of the virtual machine is incorporated into the 
distributed service reliability together with the DSRj. 

In order to calculate distributed service reliability, some additional 
assumptions on statistical independence are needed: 

1 ) DSR/ (i= 1 , 2 ,.. .Jfy is assumed to be mutually independent. 

2) The files prepared in the virtual machine are also mutually independent. 

3) The programs running in the virtual machine are mutually independent. 

Although the independence assumption may not always be true, they are first 
order approximation. 

The distributed service reliability function to the initial time, t b , can be 
calculated by 

(6. i s) 

i=i /- i *=i 

Eq. (6.18) can be explained as follows. The virtual machine can be viewed as a 
perfect node in calculating DSRj without considering the availability of prepared 
files and executed programs in it. Thus, the service reliability is the whole 
distributed system reliability ^ DSR t multiplied by the availability of files 
and programs in the virtual machine. 

Furthermore, the availability of files and programs in the virtual machine can 
be expressed as the product of n;,w> and . Hence, the overall 

distributed service reliability function which is the product of all three quantities 
can be expressed as in the above equation. 
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6.4.3. Algorithm for distributed service reliability 

In applying the general approach, we will need the system structure and then the 
above model can be used. The algorithm for the calculation of the distributed 
service reliability can be presented as the following six steps: 

Step 1: Identify the structure of Centralized Heterogeneous Distributed 
System and relationship between programs and files. 

Step 2: Obtain the availability function of the virtual machine with any 
existing models. 

Step 3: Let the virtual machine to be a perfect node in every subsystem and 
calculate DSRi (i= 1 ,2,... ,N). 

Step 4: Using the critical path method to determine Ty (/= 1,2,.../) and T bp , 
T e k x (*=1,2 K). 

Step 5: Calculate P f (j) and P pr (k). 

Step 6: Calculate the distributed service reliability function at time t b . 

Note that we can implement different models and methods to calculate distributed 
service reliability. For subsystems, the DSR, can be calculated through the 
algorithms, e.g. MFST (Kumar el al., 1986), FST (Chen & Huang, 1992), HRFST 
(Chen et al, 1997), etc. For the availability function of the virtual machine A(t), 
it can be calculated through the models presented by Lai et al. (2002). 



6.5. Notes and References 

In the distributed computing systems, the group of MFST algorithms is further 
developed. Kumar (1988) proposed a “Fast Algorithm for Reliability Evaluation’" 
that used a connection matrix to represent each MFST and proposed some 
simplified techniques for speeding up the analysis process. Then, Kumar & 
Agrawal (1996) further introduced “Distributed Program/System Performance 
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Index” which can be used to compare networks with different features for 
application execution. 

For the group of FST analytical tools, Lin et al. (1999a) further presented an 
efficient algorithm for reliability analysis of distributed computing systems. The 
efficient algorithm was studied specifically in the different kinds of network 
topologies such as star topologies (Chang et al., 2000 and Lin, 2003) and ring- 
type topologies (Lin et al., 2001). This group of analytical tools can be further 
extended to allow the failures of imperfect nodes, see e.g. Ke & Wang (1997) and 
Lin et al. (1999b). 

Other than the above two groups of analytical tools, Lopez-Benitez (1994) 
also presented a modeling approach based on stochastic Petri nets to estimate the 
reliability and availability of programs in a distributed computing system. Later, 
Chen et al. (1998) presented a Markov model to study the distributed system 
reliability with the information on time constraints. Malluhi & Johnston (1998) 
developed a distributed parallel storage system to achieve scalability and high 
data throughput. Fricks et al. (1999) proposed an analytic approach, based on the 
Markov regenerative processes and the Petri nets, to compute the response-time 
distribution of operator consoles in a distributed process control environment. 
Das & Woodside (2001) evaluated the layered distributed software systems with 
fault-tolerant features. Yeh & Chiu (2001) proposed a reversing traversal method 
to derive a k-node distributed system under capacity constraint. Chiu et al. (2002) 
recently developed a reliability-oriented task allocation scheme for the distributed 
computing systems. Mahmood (2001) discussed the task allocation algorithms for 
maximizing reliability of heterogeneous distributed computing systems. 

Fahmy (2001) considered reliability evaluation in distributed computing 
environments by using the concept of Analytical Hierarchy Process (AHP). 
Lanus et al. (2003) presented hierarchical composition and aggregation models 
based on Markov reward models to study the state-based availability and 
performability of distributed systems. Yeh (2003) extended the distributed system 
reliability by introducing a multi-state concept. 
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RELIABILITY OF GRID 
COMPUTING SYSTEMS 



Grid computing is a recently developed technique for complex systems with 
large-scale resource sharing, wide-area program communicating, and 
multi-institutional organization collaborating etc. Many experts believe that the 
grid technologies will offer a second chance to fulfill the promises of the 
Internet (Forster et al., 2002). However, it is difficult to analyze the grid 
reliability due to its highly heterogeneous and wide-area distributed 
characteristics. 

This chapter first presents a brief discussion of the Grid computing system. 
A general grid reliability model is then constructed. We also present 
approaches to compute the grid reliability by incorporating various aspects of 
the grid structure including the resource management system, the network and 
the integrated software/resources. 
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7.1. Introduction of the Grid Computing System 

7.1.1. Grid technology 

The term “Grid” was created in the mid 1990s to denote a proposed distributed 
computing infrastructure for advanced science and engineering (Foster & 
Kesselman, 1998). Grid concepts and technologies were first developed to enable 
resource sharing within far-flung scientific collaborations. Applications include 
collaborative visualization of large scientific datasets (pooling of expertise), 
distributed computing for computationally demanding data analyses (pooling of 
compute power and storage), and coupling of scientific instruments with remote 
computers and archives (increasing functionality as well as availability). 

The real and specific problem that underlies the Grid concept is coordinated 
resource sharing and problem solving in dynamic, multi-institutional virtual 
organizations (Foster et al., 2001). The sharing that we are concerned with is not 
primarily file exchange but rather direct access to computers, software, data, and 
other resources. This is required by a range of collaborative problem-solving and 
resource-brokering strategies emerging in industry, science, and engineering. 
This sharing is highly controlled, with resource providers and consumers defining 
what is shared, who is allowed to share, and the conditions under which the 
sharing occurs. A set of individuals or institutions are defined by such sharing 
rules form what is usually called virtual organization (VO). 

For example, in a data grid project thousands of physicists at hundreds of 
laboratories could be involved. They can be divided into different virtual 
organizations according to their locations or functions. It is depicted by Fig. 7.1. 

In this case, virtual organizations can vary tremendously in their purpose, 
scope, size, duration, structure, community, and sociology. A careful study of 
underlying technology requirements, however, leads us to identify a broad set of 
common concerns and requirements and current distributed computing 
technologies do not address the concerns and requirements of the grid. 
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Fig. 7.1. A grid computing system containing many virtual organizations. 



Over the past several years, research and development efforts within the grid 
community have produced protocols, services, and tools that address precisely 
the challenges that arise when we seek to build scalable virtual organizations, 
e.g. Foster & Kesselman (1998), Foster et al. (2001, 2002), Frey et al. (2002) 
and Buyya et al. (2003). 

Because of their focus on dynamic, cross-organizational sharing, Grid 
technologies complement rather than compete with the existing distributed 
computing technologies. For example, enterprise distributed computing 
systems can use the grid technologies to achieve resource sharing across 
institutional boundaries. The grid technologies can also be used to establish 
dynamic markets for computing and storage resources. 

The continuing decentralization and distribution of software, hardware, and 
human resources make it essential that we achieve the desired quality of service 
(QoS) on resources assembled dynamically from enterprise, service provider, 
and customer systems. This also requires new abstractions and concepts that let 
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applications access and share resources across wide-area networks. Common 
security semantics, system reliability, distributed resource management 
performance, or other QoS metrics need to be provided. 

Although the development tools and techniques for the grid have been 
studied, grid reliability analysis is not easy due to the complexity of the grid. 
As one of the important measures of QoS for the grid, the grid reliability needs 
to be precisely and effectively assessed using new analytical tools. This chapter 
presents some new results based on general grid reliability models that relax 
some unsuitable traditional assumptions in the small-scale distributed 
computing systems. 



7.1.2. General architecture of grid computing system 

The general architecture of the grid computing systems can be depicted as Fig. 

7.2. The virtual node is a general unit in the grid, which can execute programs 
or share resources. Virtual nodes are connected with each other through the 
virtual links. Virtual organizations are made up of a number of virtual nodes. 




V0(.VN®O' r " 



VN: Virtual Node 
VL: Virtual Link 
VO: Virtual Organization 



RMS: Resource Management System 



Fig. 7.2. General architecture of grid computing systems. 
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A grid system is designed to complete a set of programs/applications, so 
that to complete certain tasks. Executing those programs need use some 
resources in the grid. These programs and resources are distributed on the 
virtual nodes as in Fig. 7.2. A virtual link between two virtual nodes (i and j), 
denoted by L(i,j), is defined as a direct communication channel between the 
two nodes i and j without passing through other virtual nodes. 

Let U n represent the set of resources shared by the n:th virtual node and 
V n represent the set of programs executed by the n:th virtual node, 
(n= 1,2,. ..,1V). We also assume that M programs denoted by Pj , P 2 P M are 
running in the grid system. The required processing time for each program is 
denoted by f(l), t( 2) ,..., t(M), respectively. The programs may use some 
necessary resources during their execution, which is in fact to exchange 
information between them through the network. These resources are denoted by 
R\,R 2 ,..., r h which is registered in a resource management system of the grid. 

When a program requests certain remote resources, the resource 
management system receives these requests and matches the registered 
resources to the requests. It then instructs the program the sites of those 
matched resources. After the programs know the sites of their required 
resources, they begin to access to them through the network. 

In an early stage, the grid reliability is mainly determined by the reliability 
of the resource management system, while in a later stage, the grid reliability is 
mostly affected by the reliability of the network for communicating or 
processing. The grid reliability model related to the two stages will be studied 
respectively in the following two sections. Then, Section 7.4 further integrates 
other components such as software and resources etc into the grid reliability 
analysis. 
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7.2. Grid Reliability of the Resource Management 
System 

Before the programs begin to access to their required resources in the grid, they 
have to know the sites of those resources, which is managed by the resource 
management system. The resource management system of the grid, see e.g. 
Livny & Raman (1998), is to receive the resource requests from application 
programs, and then to match the requests with the registered resources. 



7.2.1. Introduction of resource management system 

For grid computing, the resource management system that manages its pool of 
shared resources is very important. This is especially the case for Open Grid 
Service Architecture, see e.g. Foster et al. (2002), that allows individual virtual 
organizations to aggregate their own resources on the grid. 

The resource management system provides resource management services, 
which can be divided into four general layers as depicted by Fig. 7.3. They are 
program layer (A), request layer (B), management layer (C) and resource layer 
(D). 

A. Program layer. The program layer represents the programs (or tasks) of 
the customer’s applications. The programs describe their required 
resources and constraint requirements (such as deadline, budget, 
function etc). 

B. Request layer : The request layer represents the program’s requirement 
for the resources. This layer provides the abstraction of “program 
requirements” as a queue of resource requests. 

C. Management layer: The management layer may be thought of as the 
global resource allocation layer and its principal function is to match 
the resource requests and resource offers so that the constraints of both 
are satisfied. 





Computing System Reliability 



185 



D. Resource layer. The resource layer represents the registered resources 
from different sites including the requirements and conditions. 
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Fig. 7.3. Layers of resource management system. 



In grid computing, failures may occur at any of the layers in the resource 
management system. For example, 

1) In the program layer, the resource described by the program may be 
unclear or translated into wrong resource requests. 

2) In the request layer, the request queue may be too long to be waited by 
the program (generating so called time-out failures), or some requests 
may be lost due to certain management faults. 
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3) In the management layer, the request may be matched to a wrong 
resource because of misunderstanding or faulty matching. 

4) In the resource layer, the virtual organization may register wrong 
information of their resources or remove its registered resources without 
notifying/updating the resource management system. 

If a grid program experiences any of the above resource management system 
failures, the program cannot be achieved successfully. The grid reliability 
should be computed by considering not only the reliability of physical networks 
or processing elements but also the resource management system reliability. In 
order to analyze the resource management system reliability, we construct a 
Markov model below. 



7.2.2. Markov modeling 

For the resource management system, if any failure that the program is matched 
to a wrong resource occurs, the program will send a failed feedback to it. It will 
remove the faults that cause the failures through an updating/debugging 
process. It is also possible for new faults to be generated in the resource 
management system such as some virtual organizations register wrong 
resources to it, etc. The assumptions for our resource management system 
reliability model are listed as follows: 

1) The failures of resource management system follow an exponential 
distribution with parameter A(k) where k is the number of contained 
faults. 

2) If any failure occurs, a fault that causes this failure is assumed to be 
removed immediately by an updating/debugging process, i.e. the time 
for removing the detected fault is not counted. 

3) The resource management system may generate a new fault, and the 
occurrence of such event follows an exponential distribution with a 



constant rate v. 
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According to the above assumptions, the reliability model of resource 
management system can be constructed by a continuous time Markov chain 
(CTMC). This Markov model depicted in Fig. 7.4 is a typical birth-death 
Maikov process with infinite number of states, where state k represents k faults 
contained in the resource management system. 




Fig. 7.4. CTMC for resource management system reliability model. 



In this model, m can be a function of the number of remaining faults k. 
Usually, A(k) is an increasing function to the number of remaining faults k. It 
is desired for a resource management system to be in service for a long time, 
especially for the Open Grid Service Architecture (Foster et al., 2002), so the 
birth-death process of failures can be viewed as a long-run Markov process 
(Trivedi, 1982). After running for a long time, the expected death rate Mk) 
will approach to a steady value. The failure rate A(k) can be approximately 
viewed as a constant during a small enough time. An example is illustrated 
below. 



Example 7.1. Consider a grid program denoted by PI need access to remote 
resources. The time for resource management system to deal with its request is 
supposed ?=15 seconds and the failure rate of resource management system at 
that time slot A = 0.0005 per second. The reliability for the resource 
management system to deal with the request is computed as 

R„ us (PI) = exp (-A ■ t } =0.992528 
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Based on the long-run birth-death Maikov process, this approximation of 
constant failure rate indicates a way to reasonably and dynamically update the 
failure rate at different time slots. The resource management system can count 
the number of failures, say n, reported by the grid programs between a 
relatively small time interval, say At , and dynamically updates the value of 
failure rate by X = n / At . 

Also, the fault birth rate v can be reduced through some information 
controls such as standardized resource registering, synchronic resource 
updating, consistent resource descriptions etc, so that to improve the reliability 
of the resource management system. 



7.3. Grid Reliability of the Network 

If the resource management system has informed the programs of the sites of 
their required resources in the grid after matchmaking, the running programs 
are able to access to those resources through the grid network as depicted by 
the previous Fig. 7.2. Then, the grid program/system/service reliability is 
mainly determined by the reliability of network, which will be studied in the 
following subsections. 



7.3.1. Reliability model for the grid network 

To analyze the grid reliability, two assumptions about the model are given 
below: 

1) The failures of virtual nodes and virtual links can be modeled by 
Poisson processes. 

2) The failures of different elements (nodes and links) are independent 
from each other. 
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The first assumption can be justified as in the operational phase without 
debugging process, the failure rates can remain constant, see e.g. Yang & Xie 
(2000). The second assumption can be explained as that since the grid is a 
wide-area distributed system, the nodes and links should be allocated far away 
from each other so that the possibility of correlation among them can be viewed 
as very slight or even negligible. 

Different programs can exchange information of different sizes with the 
same resources. Denote by D mh the size of information exchanged between 
program P m ( m=l,2,...M ) and resource R h (h=l, 2, The 

communication time T c (i, j) between node i and node j, can be derived from 



T c d,j) = 



D(iJ ) 
S(i,j) 



(7.1) 



where D(iJ) is the total size of information exchanged through the L(ij), and 
S(ij) is the expected bit rate of the link. 

Denote the failure rate of the node n by A„ and of the link L(ij) by A t j. 
If any failure occurs either on the link or on the connected two nodes during the 
communication, the communication process is viewed as a failed process. The 
reliability of communication between node i and node j through the link L( i.jj 
can be expressed as 

R c (i, j) = expHA +V Ajj )T C («, j) ) (7.2) 



Similarly, during the execution of a program, any failure occurring on the 
virtual node that executes the program will also make the program failed. The 
reliability of the node n to run the program P m , is then given by 

/? p (m,n) = exp{->l fl f(m)} (7.3) 



This network reliability model is much more reasonable for the grid than 
that of conventional distributed systems shown in Chapter 6. Those 
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conventional models somehow inherit the assumptions of Kumar (1986) model. 
The most stringent assumption that is not suitable for the grid is that the 
operational probabilities of nodes or links are assumed constant, i.e. R c (i,j) 
and R p (m,n) in the above two equations are constant no matter how long or 
how different the T c (i,j ) and t(m) are. 

Some concepts of grid reliability are defined as follows. 



Definition 7.1. Grid program reliability (GPR) is defined as the probability of 
successful execution of a given program running on multiple virtual nodes and 
exchanging information through virtual links with the remote resources, under 
the environment of grid computing system. 



Then, the grid system reliability (GSR) can be defined as the probability for all 
of the programs involved in the considered grid system to be executed 
successfully. 

Furthermore, a grid service is to complete certain programs by using some 
resources distributed in the grid. The grid service reliability is similar to the 
grid system reliability by considering the programs of the given service, i.e. 
without taking other programs that are not used by the service into account. 
Thereby, the grid service reliability is defined as the probability that all the 
programs of a given service are achieved successfully. 



7.3.2. Reliability of minimal resource spanning tree 

Recall that the set of virtual nodes and virtual links involved in running the 
given programs and exchanging information with the resources form a resource 
spanning tree. The smallest dominating resource spanning tree (RST) is called 
MRST (Minimal Resource Spanning Tree). The reliability of an MRST is the 
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probability for the MRST to be operational to execute the given program. The 
reliability of an MRST denoted by R mrst has three parts: 

1) Reliability of all the links contained in the MRST during the 
communication. 

2) Reliability of all the nodes contained in the MRST during the 
communication. 

3) Reliability of the root node that executes the program during the 
processing time of the program. 

The reliability of the link L(i,j) for exchanging the information can be 
expressed by 

j)=txp{-A u T c (i, j )} (7.4) 

The total communication time of the node G ; can be calculated by 

T(j)= J^T c (i,j) (7.5) 

ieDj 

where Dj represents the set of nodes that communicate with the node Gj in 
the MRST. The reliability function of the node G ; for communication is 

/?,(;) =exp{-/t/0)} (7.6) 

Finally, the reliability for a program P m to be executed successfully during 
the processing time t(m) on the node n is R p (m,n). 

The reliability of the MRST can be derived from the above equations as 

Rmrst = R p ( m ' n ) y ) 

UiJ)eMRST GjeMRST 
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=expM n r(m)} exp{-/i, (f , j ) } f]exp(-/l j 7’(7)} 

UiJjeMKST GjSMRST 

=exp{-^„[t(m) + r(n)]} J"[exp{-4 :J T e (i, j)) ]“[exp{-/l y 7’0')} (7.7) 

m,j)eMRST GjZMRST 

j* n 

In order to simplify the expression, we generalize the term of 
“communication time” for the root node that contains not only the time of 
exchanging information with other elements but also the time of executing the 
given program, i.e. t(m)+T(n) . 

The term of “ element ” is defined here to represent both the nodes and links 
of the MRST. Assume there are totally K elements in an MRST, so that 
element j (i=l,2,...,K) denotes the z : th element in the MRST. Accordingly, the 
communication time of the z:th element is denoted by T w (element i ) and 
Aielementj) represents its failure rate. The reliability of the MRST of the 
above equation can be simply expressed as 

K 

R mrst = I"I exp f 7 * ( elemenl , ) ' Melement, ) } (7.8) 

i=i 

With this equation, the reliability of an MRST can be computed if the 
communication time and failure rate of all the elements are given. Hence, 
finding all the MRSTs and determining the communication time of their 
elements are the first step in deriving the grid program reliability and grid 
system reliability. 

The same program executed by different root nodes may cause different 
communication time on the same elements. Hence, the MRSTs should be 
treated distinctly for the same program executed by different nodes. An 
example is given below. 
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Example 7.2. As shown in Fig. 7.5, program PI can run successfully when 
either computing node G1 or G4 is successfully working during the processing 
time, and it is able to successfully exchange information with the required 
resources (say Rl, R2 and R3). 




Fig. 7.5. A four-node computing system. 



The MRSTs considering the communication time of the elements should be 
separated into two parts: 

(a) PI being executed by G1 contains three MRSTs : 1) {Gl, G2, L( 1,2)}; 2) 
{G1,G2,G3, L (1,3); 3} {G1,G3,G4,L(1,3),L(3,4)}. 

(b) PI being executed by G4 contains another three MRSTs: 4) {G3, G4, 
L(3,4)}; 5) {G2, G3, G4, L(2,4), L(2,3)}; 6) {G1,G2,G4,L(1,2),L(2,4)} 



An algorithm is presented in Dai et at. (2002) to search the MRSTs for a given 
program executed by one given virtual node. Repeatedly using this algorithm. 
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all the MRSTs for different virtual nodes to execute this program can be found, 
respectively. This algorithm can be briefly described as follows: 

Step 1. Start from the given node to search the required resources along the 
possible links, and record elements that compose the searching route 
and their communication times. 

Step 2. Until all the required resources are reached, an MRST is found, and 
record this MRST. 

Step 3. Then other routes are tried to search other MRSTs until all the 
MRSTs are searched. 



An example of the algorithm to search the MRSTs is illustrated below. 



Example 7.3. Continued with the above Example 7.2. Referring to Fig. 7.5 
again, the program P, is assumed to exchange information with resources 
R1,R2,R3 (corresponding exchanged information size are: 500,400,300 Kbit). 
The bit rates of links L(l,2), L(l,3), L(2,3), L(2,4), L(3,4) are assumed 30, 20, 
40, 50, 45 (Kbit/s). Then, search the MRSTs for P, executed by the node G1 
and compute the communication time of each elements in those MRSTs, as 
shown by Fig. 7.6. 

Three MRSTs are found by the algorithm marked by © in the Fig. 7.6 
where all the values in vector RV are 0. The corresponding elements contained 
in those MRSTs arc recorded in vector EV with the value 1 and the 
corresponding communication time is saved in vector WV. 

Similarly, other three MRSTs for P l executed by the other node G4 can 
also be obtained as listed in the above Example 7.2. 
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© MRSTl © MRST2 




Fig. 7.6. Searching the MRSTs of PI executed by Gl. 



7.3.3. Grid program and system reliability 
Grid program reliability 

Note that failures of all the MRSTs will lead to the failure of the given program, 
and any one of the MRSTs can successfully complete the program only if all of 
its elements are reliable. The grid program reliability of a given program can be 
described as the probability of having at least one of the MRSTs working 
successfully, 

R(P m ) = Pr(at least one MRST of a given program P m is reliable) 

Let N,(PJ be the total number of MRSTs for the given program of P m and 
Ej be the event in which the MRSTj , j= 1,2,..., N t (P m ) , is able to 
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successfully execute the given program. The grid program reliability of a given 
program P m can be written as 

[N, (Pm) 1 

/W=Prj (7.9) 

By using the concept of conditional probability, the events considered in this 
equation can be decomposed into mutually exclusive events as 

R(P m ) = Pr (E x ) + Pr(£ 2 ) Pr(£,j£ 2 ) . . .+ 

Pr i e n, </>„ ) } Pr { £ i , E 2 » ' • • . e n, (P m H | E Ni (P m ) } (7 1 0) 

where Pr(£ , 1 |£ 2 ) denotes the conditional probability that MRST^ is in the 
failure state given that MRST 2 is in the successful state. 

Hence, the grid program reliability can be evaluated in terms of the 
probability of two distinct events. The first event indicates that the MRST t is 
in the operational state while the second indicates that all of its previous trees 
MRSTj (j=l,2,...,i-l) are in the failure state given that MRST t is in the 
operational state. The probability of the first event, Pr( E ( ) is straightforward, 
and it can be calculated through Eq. (7.8). The probability of the second event, 
Pr (E l ,E 2 ,-,E,-i\E i ) , can be computed using the algorithms presented by Dai 
etal. (2002). 

The brief introduction of the algorithm is given here. It contains two steps. 

Step 1 identifies all the conditional elements that can lead to the failure of 
any MRSTj (j=l,2,...,i-l) while keeping MRST ( to be operational. 
Such a conditional element, say element k (contained in any MRSTj , 
7=1,2,...,/ -1), has starting time and end time. If any failure occurs on 
the element k between its starting time and end time, it can lead the 
MRSTj to fail. 
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Step 2 uses a binary search tree (Johnsonbaugh, 2001: pp. 349-354) to seek 
the possible combinations of these identified elements that can make all 
the MRSTj (/'= 1,2,..., /-I ) fail and computes the probabilities of those 
combinations. 

The summation of the probabilities is the result of Pr(£, , E 2 , ■ ■ ■ , £’ ( _| | ) . 
For detailed procedures of the two steps can be found in Dai et al. (2002). An 
example of this algorithm can also be found. 



Grid system reliability 

The grid system reliability equation can be written as the probability of the 
intersection of the set of MRSTs of each program, which is 



R s =?v\f]MRSnPj) 



lm=l 



(7.11) 



where MRST(P m ) denotes the set of all the MRSTs associated with the 
program P m . 

The intersection of the trees of each MRST(P m ) can be evaluated first by 
intersecting MRST(Pi)- The intersected tree of two MRSTs is generated by 
putting all the elements of the two MRSTs together, where the communication 
time of overlapped elements should be added together. An example of 
intersected MRST is illustrated below. 



Example 7.4. Suppose one MRST related to program P t is 
{G1,G2,G3,L(1,3),L(2,3)} with the communication time {45, 7.5, 22.5, 15, 
7.5} and one MRST related to program P 2 is {G1,G2,G3,L(1,2),L(1,3)} with 
the communication time {50, 70, 30, 20, 30}. Then, the intersected MRST of 
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the above two MRSTs should be {G1,G2,G3,L(1,2),L(1,3),L(2,3)} with the 
communication time {95, 77.5, 52.5, 20, 45, 7.5}. 



In fact, if any one of the intersected MRSTs of MRST(P m ) (m=l,2,...,M) is 
reliable, all the programs required in the grid system can be successfully 
completed; If all the intersected MRSTs fail, the grid system cannot be 
successfully completed. 

After generating all the intersected MRSTs, the grid system reliability can 
be written as 



( N t 



* s = Pr| 



KM ) 



(7.12) 



where N, is the total number of intersected MRSTs. This equation is similar to 
the previous Eq. (7.9), so the above algorithms for deriving the grid program 
reliability can be similarly used in deriving the grid system reliability here. 



Grid service reliability 

The grid service reliability can be viewed as a special type of the grid system 
reliability if we consider the grid service in a way that the whole grid system is 
only providing this required service and other services are not considered now. 
With this classification, the concept of grid system reliability is generalized to 
include the reliability of different number of services. 

All the above algorithms computing the grid program/system reliability are 
illustrated by a numerical example as below, and then the reliability of resource 
management system is also integrated into the grid reliability analysis. 





Computing System Reliability 



199 



Example 7.5. Suppose that a simple grid system is to provide a web service of 
“Stock Analysis” for different countries. Three different resources (R1,R2,R4) 
store the real-time stock price of different countries, and another resource (R3) 
is the database of a website that outputs and shows the results out of the “Stock 
Analysis”. The service procedure can be described as that two programs (PI 
and P2) collect data from the three resources (R1,R2,R4) to analyze the stock 
market information for different countries, and then output the results into the 
database (R3) which can be loaded by a website service. 

Revisit Fig. 7.5 that contains four virtual nodes and five virtual links and 
runs the two programs and prepare the four resources. Tables 7. 1-7.2 show the 
necessary input information. 



Table 7.1. Failure rate and speed of elements (links and nodes). 





L(l,2) 


L(U) 


L(2,3) 


L(2,4) 






G2 


G3 


G4 


Failure 

rate 


0.001 


0.002 


0.003 


0.004 


0.005 


0.001 


0.0001 


0.003 


0.004 


Speed 

(Kbps) 


30 


20 


40 


50 


45 











Table 7.2. processing time and information exchanged with the resources. 



Program 


Run Time (Sec) 


Resources 


Exchanged information (Kbit) 


PI 


30 


Rl, R2, R3 


500,400,300 


P2 


50 


R3, R4 


200,600 



With the approaches presented above. Table 7.3 shows all MRSTs of the 
program PI with the communication time of each element evaluated by the 
above Example 7.3 and its reliability Pr(£ ( ) calculated by Eq. (7.8). Table 7.3 
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also shows the conditional probability of p=Pr{E l ,E 2 ,---, E m \E : ) evaluated 
similarly as the above Example 7.4. 



Table 7.3. Evaluation for the grid program reliability of PI. 



MRSTi 


Elements 


Communication 

timetT*) 


Pr(£,) 


P 


MRST, 


GI,G2,L(I,2) 


40,10,10 


0.950279 


— 


mrst 2 


G 1 ,G2,G3,L( 1 ,3),L(2,3) 


45,7.5,22.5,15,7.5 


0.847258 


0.010198 


mrst , 


G 1 ,G3,G4,L( 1 ,3),L(3,4) 


45,21.7,6.7,15,6.7 


0.818403 


0.001000 


mrst 4 


G3,G4,L(3,4) 


11.1,41.1,11.1 


0.776313 




MRST s 


G2,G3,G4,L(2,4),L(2,3) 


22.5,12.5,40,10,12.5 


0.755973 


0.002314 


MRST, 


G 1 ,G2,G4,L( 1 ,2),L(2,4) 


16.7,26.7,40,16.7,10 


0.789725 


0.000810 



Substituting the values of Pi^ii,) and Pr(E l ,E 1 ,“ , ,E i _ l \E j ) of Table 7.3 
into Eq. (7.10), the grid program reliability of PI is 

/?(/ > l)= 0.99309 

Similarly, the grid program reliability of P2 can be obtained as 

R(P2)= 0.773368 + 0.769665 x 0. 122694 + 0.90801 x 0.0907 =0.9501 58 

where three MRSTs are found for P2 to be executed by G2. 

The grid system reliability can then be derived. The total number of 
intersected trees is 6x3=18. Similar to grid program reliability, the grid 
system reliability is obtained as 

R s =0.926380 
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Suppose that the total time for resource management system to deal with 
the program Pi’s requests is f=15 seconds and the failure rate at that time slot 
A = 0.0005 (sec 1 ). The reliability for the request of the program PI is then 
computed as 

Rrms (PI) = exp(-vfr) =0.992528 

The grid program reliability of PI considering the reliability of resource 
management system can be calculated by multiplying the above R rms 
together with R(P 1 ) as 

GPR(Pl) = R rms (PI) ■ R(P) = 0.992528x0.99309 = 0.98567 

For P2, if the total time for resource management system to deal with its 
resource requests is 10 seconds, a similar way can be used to obtain 

r rms ( p2 ) = exp(-0.0005x 10) = 0.99501 

Multiplying it with R(P2), we get 

GPR(P2) = R rms (P2) ■ R(P2) = 0.99501x0.950158 = 0.94542 

For the grid system reliability that includes both PI and P2, the reliability can 
be computed as 

GSR = R rms (PI) • Rrms( p 2 ) ' R , = 0.992528x0.99501x0.926380 = 0.91487 



7.4. Grid Reliability of the Software and Resources 

In the above section, the grid reliability is analyzed by considering only the 
network hardware failures, i.e. failures of processing nodes and communicating 
links. Flowever, software program failures and resource failures should also be 
integrated into the grid reliability analysis. 
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7.4.1. Reliability of software programs and resources 

Besides hardware causes, failures of a software program may also be caused by 
the faults in the program itself. In the operational phase, the software program 
failures can be assumed to follow the exponential distributions here. The 
software failure occurrence rate of program P t running on processing node 
Gj is denoted by A s (t, j) , because a same program running on different 
processing nodes may have different failure rates. Also, the processing time of 
Pj on Gj is denoted by r(i, j ) . Thus, the reliability of the software program 
Pj running on G y can be simply computed by 

fiprog O'- j) = expM s (i , j) t(i, j ) } (7.13) 

For the resource reliability, the previous section assumes that if the 
program uses the resource, the resource itself is perfect and the failures only 
occur when transferring the information through the communication network. 
However, the resource possibly risks failures when it is needed. 

Suppose the time for resource h to work is determined by the program P t 
by which the resource is requested and the node G j on which the resource is 
integrated, denoted by t(h, i,j). Also, considering the operational phase for the 
integrated resources, we denote the failure rate of the resource h on the node 
Gj by A r (h,j) , which follows the exponential distribution. Thus, the 
reliability of resource h requested by P j and integrated on G j can be simply 
expressed by 

tf res ( h > h j) = exp{ -/i r (h, j ) • t(h, i, j ) } (7.14) 

7.4.2. Grid reliability integrating software and resource failures 

In order to integrate the software program and resource failures into grid 
reliability analysis together with the hardware network reliability, we revise the 
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model presented in Section 7.3. For each virtual node, consider its programs 
and resources as its sub nodes, as shown by Fig. 7.7. Here Gj is a virtual node 
on which P xl P xm are attached as the sub nodes representing programs and 
R v{ . . . R yk corresponding to resources. 




Fig. 7.7. Virtual node and its sub nodes of programs and resources. 



Such abstraction of the Fig. 7.7 has the following advantages: 

1) The reliability of different software programs and resources can be 
integrated into the grid reliability analysis given the failure rates of all 
the sub nodes and their communication time. 

2) It incorporates the hardware reliability in the grid reliability analysis 
and the common cause failures among those programs and resources are 
considered. For example, if Gj fails, all its sub nodes (corresponding 
to the programs or resources executed by or integrated on the same 
virtual node) are no longer working. 

3) All the approaches presented in the Section 7.3 can be directly 
implemented to compute grid program/system/service reliability if each 
sub node is viewed as an element itself, and the link between the virtual 
node and its sub node is assumed to be perfect. 
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Example 7.6. Revisit Fig. 7.5. Replace the nodes with those in the Fig. 7.7 that 
considers the software program and resource failures. Fig. 7.8 depicts the new 
network graph for the grid computing system containing the sub nodes of 
programs and resources. The approaches presented by Section 7.3 can be 
directly and similarly implemented in deriving the grid reliability of Fig. 7.8. 




Fig. 7.8. Grid network containing the sub nodes of programs and 

resources. 



7.5. Notes and References 

Foster & Kesselman (1998) summarized the basic concepts of the grid and 
presented a grid development tool which addresses issues of security, 
information discovery, resource management, data management, 
communication, and portability. It is implemented in many Grid projects. 
Recently, Foster et til. (2002) further developed the grid technologies toward an 
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Open Grid Services Architecture in which a Grid provides an extensible set of 
services that virtual organizations can aggregate in various ways. 

For the resource management systems, Krauter et al. (2002) classified the 
existing techniques into different types according to their control property and 
investigated their applications. In order to address complex resource 
management issues such as cost, Buyya et al. (2002) further proposed a 
computational economy framework for resource allocation and for regulating 
supply and demand in the gird computing environments. This framework 
provides mechanisms for optimizing resource provider and consumer objective 
functions through trading and brokering services. Cao et al. (2002) also 
presented an agent-based resource management system that was implemented 
for the grid computing. It utilized the performance prediction techniques of the 
PACE toolkit to provide quantitative data regarding the performance of 
complex applications running on a local grid resource. 

For the network issues of the grid, Postel & Touch (1998) reviewed the 
evolution of network techniques in different stages and summarized those that 
could be implemented into the grid network. Keahey et al. (2002) introduced 
the concept of “network services” in their “National Fusion Collaboratory” 
project, which build on the top of the computational grids, and provide Fusion 
codes, together with their maintenance and hardware resources as a service to 
the community. Weissman & Lee (2002) also presented the design of system 
architecture, called Virtual Service Grid, for delivering high-performance 
network services. 
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SYSTEM RELIABILITY 



Most of reliability models for computing systems assume only two possible 
states of the system: operational state and failed state. In reality, many systems 
exhibit noticeable gradations of performance besides the above two. For example, 
if some computing elements in a computing system fail, the system may still 
continue working but its performance may be degraded. Such degradation state is 
another state between the perfect working state and the completely failed state. 
To study this type of systems, the multi-state system (MSS) reliability is 
investigated in this chapter. 

The chapter is divided into three parts. First, the basic concepts of the MSS 
are introduced. Some basic Markov models for MSS reliability analysis are then 
presented. Finally, the MSS failure correlation model is studied using a Markov 
renewal process model. 

8. 1. Basic Concepts of Multi-State System (MSS) 

All engineering systems are designed to perform their intended tasks in a given 
environment. Some systems can perform their tasks with various distinguished 
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levels of efficiency which can be referred to as performance levels. A system that 
can have a finite number of performance levels is referred to as a multi-state 
system, e.g. Brunelle & Kapur (1999), Pourret et al. (1999), Lisnianski & Levitin 
(2003) and Wu & Chan (2003). 

A binary system is the simplest case of the MSS having only two 
distinguished states. There are many different situations in which a system 
should be considered to be a MSS: 

1) Any system consisting of different units that have a cumulative effect on 
the entire system performance can be considered as a MSS. 

2) The performance level of elements composing a system can also vary as a 
result of their deterioration (fatigue, partial failures) or because of 
variable ambient conditions. 

8.1.1. Generic MSS model 

A system element j is assumed to have kj different states of the performance 
level, represented by the set 

8 j ~ jl > 8 8 jkj ) 

where g j t denotes the performance level of element / in the state i, is {1,2,..., kj} . 
The performance level Gj (t) of element j at any instant t > 0 is a random 
variable that takes its values from gj : Gj(t)S gj . Therefore, for the time interval 
[0,71, where T is the MSS operation period, the performance level of element j 
is defined as a stochastic process (Lisnianski & Levitin, 2003). 

In some cases, the element performance cannot be measured only by a single 
value, but by more complex mathematical objects, usually vectors. In these cases, 
the element performance is defined as a vector stochastic process G ; (l) . 
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The probabilities associated with the different states (performance levels) of 
the system element j at any instant t can be represented by the set 

P](t) = { Pj\ (0, P j2 (0,-.., p j kj (0 } (8.1) 

where 

p Ji (t) = Pr[G J (t) = g ji } (8.2) 

Note that since the states of an element compose the complete group of mutually 
exclusive events, we have 

k t 

£p;,(0 = i, o<f <r 

i=i 

Eq. (8.2) defines the mass function Gj(t) for discrete performance levels at 
any instant t. The collection of pairs g J{ , Pjj(t) , i = l,2,...,kj , completely 
determines the probability distribution of performance of the element j at any 
instant t, see, e.g., Lisnianski & Levitin (2003). 

When the MSS consists of n elements, its performance levels are 
unambiguously determined by the performance levels of these elements. At any 
time, the system elements have certain performance levels corresponding to their 
states. The state of the system has K different states and that g t is the entire 
system performance level in state i , i€ { 1 ,..., K) . The MSS performance level at 
time t is a random variable that takes values from the set {g|,...,g^j . 

Let 

} x (# 2 ! 82k 2 ) 

be a space of possible combinations of performance levels for all of the system 
elements and M = {£],•••, g* ) is a space of possible values of the performance 
level for the entire system. The transform 

<p{G x {t) G n (t)): L"->W 
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which maps the space of the elements’ performance levels into the space of 
system performance levels, is called the system structure function. Note that the 
MSS structure function is an extension of a binary structure function. The only 
difference is in the definition of the state spaces: the binary structure function is 
mapped as {0,1}'' — » {0,1} , while in the MSS, one deals with more complex 
spaces. 

A generic model of the multi-state system can be defined as follows. The 
performance processes are modelled as stochastic process G ; (t),j= 1, 2,... n, for 
each system element j. The system structure function that produces the stochastic 
process corresponding to the output performance of the entire MSS is 

G(t) = 0{G l (t),G 2 (t),...,G n (t)) 

In practice, a simpler MSS model can be used. This can be based on 
probability distribution of performances for all of the system elements at any 
instant time t during the operation period [0,7") and system structure function: 



8j(t),Pj(t), 1 


(8.3) 


0(G 1 (t),G 2 O),..,G fl (O) 


(8.4) 



The system state can also be represented in a table, in analytical form, or be 
described as an algorithm for unambiguously determining the system 
performance G(f) for any given set [G l (t),G 2 (t), -,G n (t )} . An example of MSS 
modeling is illustrated below. 



Example 8.1. Consider a 2-out-of-3 MSS. This system consists of 3 binary 
elements with the performance levels Gj(t)S {g,|,^ ( 2} = {0,l }, for /= 1,2,3, where 

{ 0, if the element i is in thestateof complete failure; 

1, if the element i functions perfectly. 

The system output performance level G(f) at any instant t is: 
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0, if there is more than one failed element; 

1, if there is only one failed element; 

2, if all the elements function perfectly. 



The values of the system structure function G(f) = 0(G 1 (f),G 2 (f),G 3 (r)) for all 
the possible system states are presented in Table 8.1. 



Table 8.1: Structure function for 2-out-of-3 system 



G,(/) 


G 2 (/) 


c 3 (0 


<p(G\ (1 ). G 2 (t). G 3 (i)) 


0 


0 


0 


0 


0 


0 


1 


0 


0 


1 


0 


0 


0 


1 


1 


1 


1 


0 


0 


0 


1 


0 


1 


1 


1 


1 


0 


1 


1 


1 


1 


2 



8.1.2. Basic MSS reliability measures 

To characterize MSS behavior from a reliability point of view one has to 
determine the MSS reliability measures. These measures can be considered as 
extensions of the corresponding reliability measures for a binary-state system. 
Brunelle & Kapur (1999) and Lisnianski & Levitin (2003) reviewed many MSS 
reliability measures. Some commonly used ones are introduced as follows. 

Since the system is characterized by its output performance G(t), the state 
acceptability depends on the value of this index. This dependency can be 
expressed by the acceptability function F(G(t)) that takes non-negative values if 
and only if the MSS functioning is acceptable. This takes place when the 
efficiency of the system functioning is completely determined by its internal state 
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(Lisnianski & Levitin, 2003). In such cases, a particular set of MSS states is of 
interest to the customer. Usually these states are interpreted as system failure 
states, which when reached, imply that the system should be repaired or 
discarded. The set of acceptable states can be also defined when the system 
functionality level is of interest at a particular point in time (such as at the end of 
the warranty period). 

More frequently, the system state acceptability depends on the relation 
between the MSS performance and the desired level of this performance 
(demand). In general, the demand W{t) is also a random process. It can take 
discrete values from the set w = (W|,...,w w } . The desired relation between the 
system performance and the demand can also be expressed by the acceptability 
function F(G(t),W(t )) . The acceptable system states correspond to 
F(G(t),W(t))> 0 and the unacceptable states correspond to F(G(t),W(t))<0. 
The last inequality defines the MSS failure criterion. 

The performance of MSS should exceed the demand. In such cases the 
acceptability function takes the form 

F(G(t),W(t)) = G(t)-W(t) 

Since G(t) and W(t) are random processes, the subset of acceptable states 
can vary in time. The system behavior during the operation period can be 
characterized by the possibility of entering the subset of unacceptable states more 
than once. The case when MSS can enter this subset only once corresponds to 
non-repairable systems. For repairable systems or for the systems with variable 
demands, the transitions between subsets of acceptable and unacceptable states 
may occur an arbitrary number of times. 

Some other reliability measures are based on the above acceptability 
function F(G(t),W(t)) . The following random variables can be of interest: 

(a) Time to failure, T f , is the time from the beginning of the system life up 
to the instant when the system enters the subset of unacceptable states the 
first time. 
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(b) Time between failures, T b , is the time between two consecutive 
transitions from the subset of acceptable states to the subset of 
unacceptable states. 

(c) Number of failures, N T , is the number of times the system enters the 
subset of unacceptable states during the time interval [0,T]. 

The probability of a failure-free operation or reliability function is 

R(t) = Pr{7y >f | F(G(0),W(0))>0} 

The Mean Time To Failure (MTTF) is the expected time up to the instant 
when the system enters the subset of unacceptable states for the first time, as 

£(7>). 

The MSS instantaneous (point) availability A(t) is the probability that the 
MSS at instant t is in one of the acceptable states: 

A(0 = Pr{£(G(/),W(f))>0) 

The MSS availability in the time interval [0, 7 | is defined as: 

, r 

A T =-jA(t)dt (8.5) 

1 o 

which represents the portion of time when the MSS output performance level is 
in an acceptable area. 

Wu & Chan (2003) further presented an MSS measure called expected utility 
function to evaluate the overall performance of the MSS at a time instant t, 
expressed by 

7=0 

where a j is the utility of the MSS to stay at state j. 
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8.2. Basic Models for MSS Reliability 

According to the generic MSS model, any system element j can have k t 
different states corresponding to the performance levels, represented by the set 
g j = [g jkj } • The current state of the element j and, therefore, the current 
value of the element performance level Gj(t) at any instant t, are random 
variables. Gj(t) takes values from gj : G ; (t)6 gj . Therefore, for the time 
interval [0,T], where T is the MSS operation period, the performance level of 
element j is defined as a stochastic process. 

In this section when we deal with a single multi-state element, the index j 
will be omitted for the designation of a set of the element’s performance levels. 
This set is denoted as g = { g k } . We also assume that this set is ordered so 
that g M > g, for any i. 



8.2.1. Non-repairable multi-state elements 

The lifetime of a non-repairable element lasts until its first entrance into the 
subset of unacceptable states. In general, the acceptability of the state of an 
element depends on the relation between the performance of the element and the 
desired level of this performance (demand). The demand W(t) is also a random 
process that takes discrete values from the set w= . The desired 

relation between the system performance and the demand can be expressed by 
the acceptability function F(G(t),W(t)). 



Minor failures 

First consider a multi-state element with only minor failures defined as failures 
that cause element transition from state i to the adjacent state i- 1. In other words, 
minor failure causes minimal degradation of element performance. The CTMC 
for such an element is presented in Fig. 8.1. 
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Fig. 8.1. CTMC for non-repairable element with minor failures. 



Denote by Pj(t) (/= 1 .2,...k) the probability for the system to stay at state i at time 
instant t. Then the Chapman- Kolmogorov equation can be written as 

P/(0 = VAiW-'WW* t=2,3,-.-.*-l (8.6) 

Px(t) = A 2 yP 2 {t) 

We assume that the process starts from the best state k with a maximal 
element performance level of g k . Hence, the initial conditions are 

W 0) = 1 

and 

P k -i(0) = P k _ 2 (0) = ... = P i (0) = 0 



One can obtain the numerical solution of the above differential equations under 
the initial conditions even for large k. They can also be solved analytically by 
using Laplace-Stieltjes transform in some cases. With this transform and by 
taking into account the initial conditions, one can represent the above differential 
equations in the form of linear algebraic equations and solved to obtain 



PkU) = 



_J 

■r + 'Vt-i 



/>(*) = 



1 



A.k-i 



A+l.i 



, i=2,3,...,£-l 



P t (s) = 



s + A,k-l 5 + A-l,k-2 s + Ai - 1 

1 A.k - 1 A.2 A, 2 



(8.7) 



S + A,k - 1 s + S + / ^2,l S 
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Now in order to find the function P k (t), the inverse Laplace-Stieltjes transform 
can be applied. 

The probability of the state with the lowest performance ^(0 determines 
the unreliability function of the multi-state element for the constant demand level 
> w > gj . The reliability function defined as the probability that the element is 
not in its worst state (total failure) is 

/?,(/) = l-/?(0 (8.8) 

In general, if the constant demand is g i+l >w>gj the 

unreliability function for the multi-state element is a sum of the probabilities of 
the unacceptable states l,...,i. The reliability function is then 

= (8.9) 

j = i 

The mean time up to multi-state element failure for this constant demand level 
can be interpreted as the time of the process entering state i. It can be calculated 
as the sum of the time periods during which the process is remaining in each state 
j > i. Since the process begins from the best state k with the maximal element 
performance level, we have 

MTTF t = £ — — (8.10) 

j=M A j,j-l 



Example 8.2. Consider a non-repairable multi-state system that has only minor 
failures. The system has 4 possible states whose performance levels are set as 
100, 80, 50 and 0, respectively. Its Markov model can be built as Fig. 8.1 with 
k= 4. Assume that the failure rates are given by 

A 43 =0.02, ^3 2 =0.01, =0.007 

and the initial state is the best state, state 4. 
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Substituting the above numerical values into the Laplace-Stieltjes transforms 
and inverting them, the state probabilities can be obtained. The state probabilities 
as a function of time are shown in Fig. 8.2. 




Fig. 8.2. State probabilities for non-repalrable 4-state system with 
minor failures. 



Assume that the constant demand is w = 75 . Therefore, the system is 
reliable only if the system is at least at state 3 with performance level 80. Then, 
the reliability function can be obtained as 

R 2 (t) = \-P l (t)-P 2 (t) 

Then, the mean time to failure is obtained by 

MTTF 2 =-L + -L = 150 

"X3 "3,2 
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Both minor and major failures 

Now consider a non-repairable multi-state element that can have both minor and 
major failures (major failure is a failure that causes the element transition from 
state i to state j: j <i— 1 )■ The state-space diagram for such an element 
representing transitions corresponding to both minor and major failures is 
presented in Fig. 8.3. 



A, 




Fig. 8.3. CTMC for non-repairable element with minor and major failures. 



For this Markov model, the Chapman-Kolmogorov equation can be written 
as 

P k '(‘) = ~P k (‘) l f d Kj 

7=1 

k i-l 

XVW-'WSX;’ '=2.3,. ..,^1 (8.11) 

7 = 1+1 7=1 

w=ix,p,<o 

7=2 

After solving the above equations and obtaining the state probabilities /^(f) , the 
reliability can be easily derived as Eqs. (8. 8-8. 9). 
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8.2.2. Repairable multi-state elements 
Availability modeling 

A more general model of a multi-state element is the model with repair. The 
repairs can also be both minor and major. The minor repair returns an element 
from state j to state 7+ 1 while the major repair returns it from state j to state i, 
where i > j +1, see, e.g., Lisnianski & Levitin (2003). 

A special case is when an element has only minor failures and minor repairs. 
It is actually a birth and death process. The CTMC of this process is presented in 
Fig. 8.4. The CTMC for the general case of the repairable multi-state element 
with minor and major failures and repairs is presented in Fig. 8.5. 





Fig. 8.4. CTMC for repairable element with minor failures and minor 

repairs. 




Fig. 8.5. CTMC for general repairable MSS element. 
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The following are Chapman-Kolmogorov equations for the general case: 



*-i 



AW = 2> Jlt Pj(0-P k (0ZA tJ 

;*i j-i 



i-i 






i- 1 



TO = 'ZWO + 'LmjjW-W IX; + IX; 

7=»+i j=i v> =i > =,+i y 

A k 

j = 2 ;=2 



, i=2,3 — ,*-1 

( 8 . 12 ) 



Solving the above equations, one obtains the state probabilities /^(f) (/= 1 ,2,...,k). 

When F(g t ,w) = g t -w for the constant demand level g j+l >w>g n the 
acceptable states where the element performance is above level g t are i + 1, 
k. Hence, the availability function is 

(8.13) 

;'=/+ 1 



In many applications, the long-run or final states probabilities lim P t (l) are 

/-> ~ 

of interest for the repairable element. 



For the long run state probabilities, the computations become simpler. The 
above differential equations is reduced to a set of k algebraic linear equations 
because for the constant probabilities, all time derivatives P t '(t) = 0 as below 

i t,Mj, k P j «)-P k (t)Y l A k j=0 

M j = i 



(-1 



ZV/O + I MjjW-W) IX; + IX; 

j=i+\ ;= i U=> ;=/+ 1 ) 



= 0,t=2, 3 K-l (8.14) 
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An additional independent equation can be provided by the simple fact that the 
sum of the state probabilities is equal to 1 at any time. The above equations can 
then be solved. 



Example 83. Consider a 4-state repairable system with both minor and major 
failures and repairs. The performance levels of the four states are 

8* = 100 > 8i =8°. 8i =50, g { =0 

respectively. The unit has the following failure rates: 

4, 3 =0.02, A u =0.006, A u =0.007, =0.002, A 4 J =0.005, A t l =0.003 

and the following repair rates: 

Mi, * = 1 . Mi,i = 0 8 , Mi.i = 0 5 , Mi.i = 0.32 , ju u = 0.4 , ju 2 i = 0.45 

The Markov model can be constructed as Fig. 8.5 with k= 4. Substituting the 
above numerical values into Eq. (8.12), we can obtain the state probability 
functions as depicted by Fig. 8.6. 




Fig. 8.6. State probabilities of the repairable 4-state system. 
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Assume that the constant demand w = 75. The available states of the system 
are states 4 and 3, so the system availability function is 

A(t) = P t (t) + P 3 (t) 

which is also shown in Fig. 8.6 as the dashed line. 



Reliability modeling 

The determination of the reliability function for the repairable multi-state 
element is based on finding the probability of the event when the element enters 
the set of unacceptable states the first time. In order to find the element 
reliability function /?,(?). for the constant demand tv ( g i+l > w> g t ), an 
additional Markov model should be built. All states 1,2,...,/ of the element 
corresponding to the performance levels lower than the demand w, should be 
combined in one absorbing state. This absorbing state can be considered now as 
state 0 and all repairs that return the element from this state back to the set of 
acceptable states should be forbidden. 

The transition rate A m 0 from any acceptable state m (, m > i) to the combined 
absorbing state 0 is equal to the sum of the transition rates from the state m to all 
the unacceptable states (states 1,2,...,/): 

i 

4,o ’ f° r m=i+l,...,k (8.15) 

The CTMC model for the computation of the reliability function is depicted by 
Fig. 8.7 

For this CTMC, the state probability P 0 (t) characterizes the unreliability 
function of the element because after the first entrance into the absorbing state 0 
the element never leaves it, i.e., we have 



R l (t) = l~P 0 (t). 
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Fig. 8.7. CTMC for determination of reliability function for repairable 

element 



It is easy to obtain P 0 (t) by solving the following Chapman-Kolmogorov 
equations: 






w 0 = Y,Mj,k p j«)-p k (')\ 

7=i+l 



*-i 



X^t,7 + 4t.o 

U =' +1 y 



t m-l 

^m’(0= E^7.m P 7 ( , ) + 5X" 

7=m+i 7=/+i 




, i<m<k 




7='+i 



(8.16) 



The reliability function can then be obtained. 



Example 8.4. Continue with Example 8.3. The reliability is the probability that 
the system performance level is lower than the demand w=15, i.e. the system 
leaves states 4 and 3 the first time. The Markov model can then be constructed as 
Fig. 8.8. 
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Fig. 8.8. CTMC for the reliability function of Example 8.4. 



By substituting the numerical values as given by Example 8.3 into the above 
equations and solving them, we obtain the state probability functions as 

P 4 (f) = 0.9804exp(-0.008?) + 0.0196exp(-1.028f) 

/>(/) = 0.0 1 96exp(-0.008r) - 0.0 1 96exp(- 1 .028/) 

P 0 (t) = 1 - exp(-0.008f) 

Then, the reliability function is given by 

R(t) = P 4 (f) + P 3 (t) = exp(-0.008/) 



8.3. A MSS Failure Correlation Model 

Most of MSS reliability models assume independence of successive system runs. 
It is an assumption not valid in reality. This section presents an MSS reliability 
model based on Markov renewal processes for the modeling of the dependence 
among successive runs. 
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83.1. Modeling MSS correlated failures 

Except the perfect working state, other states in the MSS can be viewed as 
different types of failure states. Note that if the failures can be of n different 
types, the total number of possible states for the MSS will be n+1, in which there 
is a perfect state. 

For the correlated MSS with n types of failures and a successful state, a 
general Maikov process can be constructed as follows: 

1) Build an n+1 -state discrete time Markov chain with transition probability 
matrix as 





Poo 


P 0i • 


” P 0 n 


p= 


Pio 


Pn ' 


- Pin 




o 


P'i ' 


' V 



2) To overcome the discrete-time property, introduce a process in 
continuous time by letting the time spent in a transition from state k to 
state l to have Cdf F t t (t) . 

Such a process is attributed to a Semi-Markov Process. 



Model for two failure states 

When there are two failure states, there will be three states for the MSS after a 
run; a successful state. Type A failure state and Type B failure state. Type A 
failure could be a kind of serious failure such as Catastrophic or Critical failure. 
Type B failure could be less serious than Type A failure such as Minor or 
Marginal failure. 

A common situation is that the system is not able to continue to perform its 
function when Type A failure occurs, but when Type B failure occurs, the system 
can still work, although it will have more chances to induce a Type A failure in 
the next run. The result from a run will affect the probable state in the next run as 
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shown in Fig. 8.9. Here we consider the case when there is no debugging except 
the resetting or restarting when Type A failure occurs. The transition probability 
will remain unchanged under this assumption. 

Let Z* be a random variable of the state after a run, and denote by 

P nj =P{Z M =j\Z k = m), m, j = 0,1,2 




State 0: Successful state after a run 
State 1 : Type A failure occurs after a run 
State 2: Type B failure occurs after a run 

Fig. 8.9. Markov interpretation of dependent runs. 



The transition matrix is 



in which 



p* 


A. 


^02 




Pn 


Pn 


/20 


P* 


^22 



£p q/ =l, m=0, 1 ,2 



o 



(8.17) 



(8.18) 




Computing System Reliability 



227 



The unconditional probability of failure on run (7+1 ) is: 

P[Z M =;} = ! P^PiZ, =m), j= 0, 1,2 (8.19) 

««0 

Substituting Eq. (8.18) into the above equation, we have that 

P[Z M = ;} = (P 2; - P 0J )P[Z i = 2} + (/>, - P 0J )P{Z, = 1} + P 0J , J=0, 1,2 (8.20) 



The next step is to develop a model in continuous time, considering the time 
that the system spends on running. Let F*,<(/) be a Cdf of the time spent in a 
transition from state k to state / of the DTMC in Fig. 8.9. Here, FfcKO is assumed 
to depend only on the state at the end of each interval in a system run, see e.g. 
Goseva-Popstojanova & Trivedi (2000) as: 

F a j (t) = Fjj(t) = F 2 J (t) = F.j(t ) , 7=0, 1,2 

With the addition of the F,j(t) to the transitions of discrete time Markov chain, 
we obtain a Semi-Markov Process as the system reliability model in continuous 
time. 



Model for two failure states with debugging 

Furthermore, we assume that after a Type A failure, the system may be debugged 
and it is an instantaneous fault removing process. Hence, after removing the 
fault, the transition probability matrix will be changed. When the successive runs 
are successful or only cause the Type B failure, the system does not have to be 
debugged and it will continue running in the same way. In this case, the 
transition probability matrix can then be assumed to be unchanged until a Type A 
failure happens. 

The Markov renewal model is modified as the Fig. 8.10. 
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Fig. 8.10. Nonhomogeneous DTMC for system reliability model. 



Here T is the number of Type A failures, which is already detected and 
removed. During the testing phase, system is subjected to a sequence of runs, 
making no changes if there is no Type A failure. When a Type A failure occurs 
on any run, then an attempt is made to fix the underlying fault, which causes the 
conditional probabilities of the state on the next run to change. The transition 
probability matrix for the period from the occurrence of the t:th Type A failure to 
the occurrence of the next (t+l):st Type A failure, is 




( 8 . 21 ) 



Assume S m is the total number of Type A failures after m runs. The sequence 
S m provides an alternate description of system reliability model with debugging 
process considered here. Thus, { S m } defines the DTMC presented in the above 
Fig. 8.10. All states, i, to and * 2 , represent that the Type A failure state has been 
occupied i times. State i represents the initial state for which S m = i. State to 
represents all the successful subsequent trials for which S m =i , State 12 
represents all Type B failures subsequent trials for which S m = i . 
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General model for n failure states 

The above models can be extended to the case of general multi-state of failures. 
Assume that the failures can be divided into n states, so the MSS totally contains 
n+ 1 states including the perfect state. Denote again the critical failure type as 
Type A failure state. When this type of failures occurs, the system will 
completely stop working and action has to be taken. First we assume there are no 
changes in the system except resetting and restarting when Type A failure occurs. 
The transition probability matrix for the successive runs will remained 
unchanged. The Markov process can be expressed as the Fig. 8.11. 




State 0: Successful state after a run 

State 1 : Type A failure occurs after a run 

States 2 to n : The other n - 1 types of failures occur after a run 

Fig. 8.11. Markov interpretation for n-type correlated state transition. 



Denote 



Pml ~ p i Z k+> = i I Z k = m},mj = 0,1,2, -,n 
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and the transition probability matrix is then 





^00 


Poi 


P()2'" 


P(tn 




Pw 


Pn 


Pn- 


Pin 


<1 

a. 


P 20 


P 2 , 


P 22 • - 


Pm 




O 


Pn 1 


... 


-Pnn 



and transition probabilities should satisfy 

^P fflj =l,m=0,l,2 n (8.22) 

i=o 

The unconditional probability of failure on run (i+l) is: 

P{Z M = j] = £ p mJ P[ z , = m] , 7=0, 1, 2 ,..., n (8.23) 

m= 0 



Similar to the previous case of two types of failures, when there is a debugging 
after Type A failure, the transition probability matrix changes accordingly. A n- 
type failure states Markov renewal model can be constructed. 



Let T be the number of Type A failure, which have already been detected and 
removed. The transition matrix for the period from the occurrence of the z':th 
Type A failure to the occurrence of the next (z+l):st Type A failure, is given as 
follow: 



P 00 ‘ 


Poi 


P02- 


' P 0 n 


PlO 


Pn 


Pn- 


■ Pin 


P 20 


p 2\ 


P 22 - 


■ P 2 J 


PnO 


Pnl 


... 


-Pnn 



and the transition probability should satisfy 

n 

Y* P >nj =1 ’ ' n=0 - 1 > 2 n 

7=0 



(8.24) 
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Again { S m } defines the DTMC. All the states, i, j'o, h,---, J m represent that the 
Type A failure state has been occupied i times. State i represents the first trial for 
which S m =i. State io represents all the successful subsequent trials for which 
S m =t. State ij to i„ represents Type 2 to n failure states subsequent trials for 
which S„ =(. 

m 



8.3.2. Application of the model 

The above Markov renewal model can be used to analyze the system 
performance in both testing phase and validation phase. In testing phase, the 
system is debugged, so the transition probabilities should change after each Type 
A failure. However, between two Type A failures, the transition probabilities are 
constant, so the distribution of time between two successive Type A failures can 
be easily derived by using the Laplace-Stieltjes transform. The conditional 
system reliability, which is defined as the survivor time distribution between two 
Type A failures, can also be obtained. 

On the other hand, the probability transition matrix will be constant during 
the validation phase after the test, because no changes are made to the system 
during that phase. Hence, the system reliability can be easily calculated. 



Some quantitative measures 

From a reliability point of view, the time between failures or the number of 
failures over time is very important. Here, we derive the distribution of the 
discrete random variable (y=0,2,3...«) defined as the number of runs 
visiting the j':th state between two successive visits from the i:th Type A failure 
to the (7+l):stType A failure. 

The probability of every possible number of X/ +1 (/=0,2,3...,n) is given by 
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P{Xf+\ = K j\j = 0,2,3.. ,,n] 

J Pn (VAT- 0| y=o>2i3 J (8 ‘ 25) 

(K 0 ,K 2 ,K 2 ,... ,K n ) (3 Kj / 0| y = o, 2 ,3 «) 

in which g(K Q ,K 2 ,K 2 ,...,K n ) is the function of K 0 ,K 2 ,K 2 ,...,K n , and Kj 
denotes the number of runs occupied on the j:th failure state ( Kj =0,1,2. . .)■ The 
value of g(K 0 ,K 2 ,K 2 K n ) can be obtained in principle. 

Under the condition of that it visits the /:th state with Kj times 
(/=0,2,3,...,«) and that Type A failure occurs once between the nth and (z+l):st 
Type A failures, the distribution of the time period used for this event can be 
derived as 

G(t) = F,o°*(t) ® F«*\t) ® /#*«)•■• ® F. K /(t) ® F, x (t) (8.26) 

in which F^ J (f) is the K } -fold convolution of F,j(t) (j = 0,2,3...,/?) and Kj 
can be 0,1,2.... Also, ' <S> ’ denotes the convolution of the two functions. 

Denote the distribution of time between the i:th and (z+l):st Type A failures 
as F M (t) . Assume T M is the random variable of time between the i:th and 
(z+l):st Type A failure runs. With the above two equations, it can be shown that 
the distribution of 

F i+i (0 = P{T M <t)= £ £ ...f i P(X/ +i = Kj\j = 0,2,...,n)-G(t) (8.27) 

K 0 =0*2=0 K„=0 

The Laplace-Stieltjes transform of /r +1 (/)can be obtained and the inversion of it 
is straightforward. A closed-form result can be obtained when 
F.jOUj = 1,2,.. .n) has a rational Laplace-Stieltjes transform. 

The reliability of the system after nth Type A failure is 



(8.28) 
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Some general properties of the inter-failure time can be developed without 
making other assumptions. For example, the mean time between failures (/ and 
i + 1 Type A failures) is: 

oo 

E[T i+i ]=lR M (t)dt (8.29) 

o 

or, see e.g. Goseva-Popstojanova & Trivedi (2000) 

£ [7; +1 ] = _^ki£)| (8 .30) 

as 



Application to the validation phase 

After the testing (debugging) phase, the system enters a validation phase to show 
that it has a high reliability prior to actual use. In this phase, no changes are made 
to the system. Here, we use the two-type failure case as an illustration. Similar 
procedures can be implemented in solving general n-type failure problems. 

First we consider the independent condition, that is, 

Poj = P\] = fyj = P»j > ) = 0 ) 1,2 

If the state is not a Type A failure after a run, the system is reliable until the Type 
A failure occurs. The reliability in a run is l — The reliability for m 
successive runs is defined as the probability that m successive independent test 
runs are conducted without Type A failure, which can be derived as: 

R(m) = (1 - P.i) m = (P. 0 + P. 2 ) m (8.31) 

Given a confidence level (X , if R(m) >01, we can say that the system is reliable 
in successive m runs without Type A failure with a confidence. In order to 
satisfy this condition, the value of P, t should satisfy 



(l-P.O m >a 



(8.32) 
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Given a confidence level a, we can obtain an upper confidence bound on P tl , 
which is denoted by . Solving (1 - P* i ) m = a, we obtain the upper bound 

P* t = 1 - a Vm (8.33) 

This can help to test whether the system can be certified or not, i.e., if P mi < , 

the system is certified with Ot confidence to say that the system is reliable in n 
successive runs without Type A failure. 

Now consider a sequence of possibly dependent system runs. During the 
validation phase, the system is not changing, i.e., /T does not change. That is, the 
sequence of runs can be described by the homogeneous DTMC with the 
transition probability matrix. Assume that the DTMC is steady, i.e., each run has 
the same failure -probability: 

P{Z M = j) = P[Z i = j) = £p mj P(Z, = m], 7=0, 1 ,2 (8.34) 

m=0 

Let Pj = />{Z, = j) and substitute it into the above equation to get 

2 (8.35) 

m= 0 

Solve the above equations to obtain unconditional probability of failure on run as 

p ^01^2 + Pq2 ~ P\iPq2 

2 (1 - P u + P 01 )(l - P 22 + P 02 ) - (/^ 2 - P 02 )(P 2| - P 01 ) 

p ^02*21 + ^01 ~ ^22 ^01 

(1 - P 22 + Pq 2 ){\ - P\\ + /oi ) - (^21 “ ^0|)(^12 - ^ 02 ) 

P^\-P,~P 2 

The reliability for m successive runs will be 

R(m) = (l-P ] ) m =(P 0 + P 2 ) m 
An example is given here to illustrate the procedure. 



(8.36) 

(8.37) 

(8.38) 

(8.39) 
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Example 8.5. Suppose the distribution of the execution time of each run is 
exponential so that 

F.j(t) = l-exp{-/r/),y=0,l,2 



Let // 0 =»0.302,/i, = 03,^2 =0-297 as illustration. In the operational phase we 
can estimate the transition probability matrix from empirical data of successive 
runs. The following transition probability matrix is used as illustration 





>00 


^01 


8° 
1 




p= 


PiO 


Pn 


P,2 


= 




P !0 


Pi\ 


1 






Substitute those values into Eq. (8.27), we can obtain the Laplace-Stieltjes 
transform equation and then invert it to get the Cdf of the time between failures 
as: 

F(t) = l-0.053e~° 3 ' -0.038e~° 302 ' -0.05k" 0297 ' 

-0.04 k~° 55 ' -0.45e" 005 ' -0.045e"° 51 ' - 0.054e"° 342 ' 

-0.157e~ 0259 ' -0. 127/e -0 09 ' -0.1 Ik" 009 ' 

This equation implies that when successive runs are dependent, the Cdf of the 
time between failures is a mixture of exponential distributions. Fig. 8.12 displays 
the distribution of F(t). 

Using the distribution function, the mean time to failure can be obtained as 

co 

E[T] = fa + , (t)dt = 27.3 1 (hours) 
o 

The unconditional probability of the threeh different states can be calculated 
through Eqs. (8.36-8.38) 
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P>=- 



■^ 02^2 1 Pytfn 



01 



L22f01 



(1 - P n + P 02 X1 - Pu + P<n)~ (Pii - -P 01 X ^2 - ^ 02 ) 



= 0.152 



P Q = \- P t - P 2 = 0.326 
The steady probability for the system to be reliable is 

P 0 + P 2 = 0.326 + 0.522 = 0.848 




Fig. 8.12. Cdf of the time between failures. 



8.4. Notes and References 

For the multi-state systems, the book of Lisnianski & Levitin (2003) summarized 
many MSS reliability models, which can provide the readers a complement view 
to this chapter. They have carried extensive research on this topic. The book 
describes many MSS reliability models of different structures including series, 
parallel, bridge and distributed networks etc, and under different environments 
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including weighted voting systems, consecutively connected systems, sliding 
window systems and so on. 

Xue & Yang (1997) showed that multi-state reliability dynamic analysis 
could be transformed to a set of 2-state ones by using some generalized reliability 
parameters. Bukowski & Goble (2001) studied the MTTF of the MSS. 
Kolowrocki (2001) studied the MSS with components having exponential 
reliability functions with different transition rates between subsets of their states, 
which introducing the aging concept into the components of the MSS. Levitin & 
Lisnianski (2001) considered vulnerable systems, which could have different 
states corresponding to different combinations of available elements composing 
the system. In real systems, a multilevel protection is often used, for example, in 
defense-in-depth design methodology (Fleming & Silady, 2002). The multilevel 
protection means that a subsystem and its inner level protection are in their turn 
protected by the protection of the outer level, which has been studied by Levitin 
(2003). Yeh (2003) presented an interesting model for the network reliability by 
assuming the nodes and links composing the network are of multiple states. 

Levitin & Lisnianski (2003) formulated the optimization problem of 
designing structure of series-parallel multi-state system (including choice of 
system elements, their separation and protection) in order to achieve a desired 
level of system survivability by the minimal cost. Liu et al. (2003) also presented 
a neural network to solve this optimization problem. Recently, Levitin et al. 
(2003) further extended it to include multiple levels of protections and presented 
a multi-processor genetic algorithm to solve it. 
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OPTIMAL SYSTEM DESIGN 
AND RESOURCE ALLOCATION 



In the design of computing systems, some important decision problems need be 
solved. These problems could be the determination of optimal number of 
distributed hosts, the system structure and the network architecture. The 
objectives could be to maximize the reliability, to minimize the cost, or both. 

Besides the optimal system design, the problem of optimally allocating 
limited resources (such as time, manpower, programs or files) on the 
computing systems are also of great concern. Given limited resources, different 
allocation strategies will cause different system reliability and cost. In order to 
make the best of the resources, their allocations must be carefully considered. 

This chapter discusses some of these optimization problems. The optimal 
number of redundant hosts for a distributed system design is first presented. 
Optimal testing resource allocation problems on either independent modules or 
dependent versions of software are discussed. Finally, the optimization of grid 
architecture design and the grid service integration problems are studied. 
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9. 1. Optimal Number of Hosts 

An important goal in computing system design is to achieve a high reliability or 
availability through some kind of redundancy (such as redundant hosts) or fault 
tolerance. Many systems are developed in the environment with redundant 
hosts. The number of hosts has significant influence on the cost and system 
availability because it can be very costly while they are able to improve system 
availability easily. The objective here is to minimize the total cost based on the 
following cost model. 



9.1.1. The cost model 

To illustrate the relationships among the decisions and cost, an influence 
diagram which provides simple graphical representations of decision situations, 
is displayed in Fig. 9.1. Different decision elements are shown in the influence 
diagram as of different shapes, see e.g., Clemen (1995 pp. 50-65). 

The number of redundant hosts affects the optimal decision of the release 
time. Both the number of redundant hosts and release time affect the system 
availability. These factors determine the development cost. The number of 
hosts also determines the cost of redundant hosts. The release time determines 
the rewards or penalty depending on whether the release is before or after the 
deadline. If the system is unavailable after release, a risk cost is incurred. 
Hence, the cost of redundant hosts, the development cost, reward and penalty 
should be considered together when deriving the total expected cost. Each cost 
component will be described in the following. 

Cost of redundant hosts 

The cost function for a multi- version fault-tolerant system can be described as a 
linear function to the number of versions as 



C h (N) = a, A/ + 6, 



(9.1) 
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where N is the number of hosts, b { is a constant, and a, is defined as the 
expected cost per host. Here we have assumed the redundant hosts used in the 
system are of the same type. 




Fig. 9.1. Influence diagram for the cost affected by redundant hosts. 



Reward for early release 

Usually there is a deadline for release. This is the case when the penalty cost 
for delay is very high. On the other hand, there is a reward for releasing the 
system earlier. We assume b 2 is a constant rewarded if the system can be 
released in time, no matter how early the release time is and a 2 is the 
expected reward per unit time before the deadline. The reward function of the 
release time can be expressed as 






242 



Optimal System Design and Resource Allocation 



B(t r ) = a 2 (T d ~t r ) + b 2 , t r < 7rf (9.2) 

where Tj is the deadline for release, t r is the release time so that T d —t r is the 
time ahead of the schedule. 



Risk cost for system being unavailable 

This cost factor is generated by the unavailable system after releasing, termed 
risk cost as in Pham & Zhang (1999). Here we assume the risk cost for 
unavailable system is a function of system availability and release time: 

T 

C r (N,t r ) = a 2 )[\-A N {Mt (9.3) 

'r 

where t r is the release time, T e is the ending time for contracted maintenance 
after release, A N (t) is the availability function at time t for /V-host system, and 
03 is the risk cost per unit time when the system is not available. In the equation 
above, 1 - < 4 ^ (t) is the probability for the system to be unavailable at time t. 



Development cost 

The development cost function for a single software module proposed in 
Kumar & Malik (1991) is 

C l (R i ) = H i exp(B l R,-D l ) (9.4) 

where //,, B, and D, are constants and Rj is the individual module software 
reliability achieved at the end of testing. 



Then, the total expected cost can be expressed as 

C(N,t r ) = C h (N) + C r (N, t r ) + C, {R(t r ))- B(t r ) 



(9.5) 
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9.1.2. System availability 

The system availability model for a homogeneous distributed 
software/hardware system can be obtained straightforward from Chapter 6.3. A 
numerical example is shown below. 



Example 9.1. Suppose K 0 = 32 and >1=0.006, \ =0.01, =0.1 and 

^.=0.13, the system availability for different number of hosts can be obtained 
from the analysis presented in Chapter 6.3. The results are depicted in Fig. 9.2. 




Fig. 9.2. System availability for different number of redundant hosts. 
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We can observe that when the number of redundant hosts increases, the 
system availability increases. The system availability function can be used in 
the optimization model which will be described in the following. 



9.1.3. Optimization model and solution procedure 

The optimization model is based on the cost criteria and the decision variables 
are the number of redundant hosts and the release time. Its objective is to 
minimize the expected total cost. There are three types of constraints in this 

decision problem. First, the customers may require a least system availability 

* 

A after the release. Second, there is a deadline for the system to be released 
so the release time should be earlier than that. Finally, the customers may limit 
the maximum number of redundant hosts N due to their budget and other 
physical restrictions. 

That is, the decision variables are N and t r , and the optimization model is 
to 



Minimize: 


C(N,t r ) 


(9.6) 


Subject to: 


A N « r )ZA'Z0 


(9.7) 




K? 

VI 

VI 

o 


(9.8) 




N = 1,2,3,..., N* 


(9.9) 



where A is the required system availability after the release, T d is the 
deadline for release and N* is the maximum number of redundant hosts 
allowed. If there is no such constraint, we can assume a large enough value of 
N* in this model. However, usually only a small number of redundant hosts 
will be practical. 

To obtain an optimal solution, the solving procedures are described as 
follows: 
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Step 1: Obtain the system availability function of the distributed system 
with IV redundant hosts. 

Step 2: Derive each cost function and obtain the expected total cost. 

* 

Step 3: Let N take each integer value from I to /V to obtain the expected 
total cost and save the results from C(l,t r ) to C(A/\f r ). 

Step 4: For each C(l,f r ) to C(N ,t r ), compute the optimal release time, 

* 

and save the results as OpTr(l) to OpTr(./V ), so that the minimum 

* 

expected total cost is obtained and saved in MinC(l) to MinC(7V ). 

Step 5: Compare the minimum total expected cost from MinC(l) to 

* 

MinC( N )to select the optimum number of redundant hosts 
OpN=Min( MinC(n)) (n=l,2,..., N ). 

The above procedure can be easily realized in computer programs. A numerical 
example is presented to illustrate the optimization procedures. 



Example 9.2. Company X is awarded a contract to develop a telephone 
switching system. In this case, the hardware hosts are brought from external 
suppliers, but the software is developed in house and tested with the system. 
The main question is how many redundant hosts are needed and also we are 
interested in when the system can be released so that the total cost is 
minimized. For illustrative purpose, the following input values are used: 

1) The system availability needs to be higher than 0.88 when it is released. 

2) The deadline for releasing the system is 800 hours from now. 

3) The penalty cost for unavailable system is about $8000 per hour during 
the first 300 hours after release. 

4) Each host costs $17600 and a fixed fee for all the hosts is $1293. 

5) The maximum number of redundant hosts is five. 
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6) If the company can release the system earlier than the deadline, there is 
a constant reward of $2123.7 and a variable reward of $31.5 per hour. 

Based on the conditions and the assumptions given above, the values of the 
parameters can be obtained as 

a x =17600, b x =1293, a 2 =31.5, 

b 2 =2123.7, a 3 =8000, T d =800 hours, 

and 7^=800 + 300 = 1100 hours. 

The parameters for software development cost are assumed as 77= 10232, 
71=16, 77=14. The optimization problem can be solved with the required system 
availability when releasing, A , of 0.88 and the maximum number of 
redundant hosts, N , equal to 5. 

Here we assume the system is a kind of homogeneous distributed 
software/hardware system whose availability function is depicted by Fig. 9.2. 
With the values of parameters given above, we can obtain the total mean cost 
through Eq. (9.5) as 

1100 

C(N,t r ) = 17600A/ +31.5/ r +8000 J[l- A N (r)>* 

h 

+ 10233 exp (167?(/ f ) - 14} - 26030.7 

Finally, the total expected cost as a function of release time for different 
number of redundant hosts are depicted by Fig. 9.3 and the overall results are 
given in Table 9.1. 



Table 9.1. Numerical values of the minimum cost for different TV. 



N 


1 


2 


3 


4 


5 


MinC(N) 


326970 


153060 


116690 


104580 


110880 


OpTr(N) 


800 


800 


324.2 


261.7 


232 
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From Table 9.1, the global minimum cost is 104580 (Units) with the 
number of redundant hosts N=4 and the optimum release time t r =26 1.7 (hrs). 
The optimum results indicate that there should be four redundant hosts and the 
system is tested for 261.7 hours. 




Fig. 9.3. Total expected cost vs. release time of different number of hosts. 



9.2. Resource Allocation - Independent Modules 

Testing-resource refers to the resource expenditures spent on software testing, 
e.g., man-power and available time, etc. During the testing stage, a project 
manager often faces various decision-making problems such as how to allocate 
available time (the time before deadline) among the modules and how to assign 
personnel, etc. In order to combine these two types of resources (man-power 
and available time), we define a term called total testing time that is calculated 
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by multiplying the number of personnel with the available time. Each unit of 
the total testing time represents the resource of one person to work for one unit 
of time. Here the testing-resource is referred to as total testing time and we use 
the term testing-resource as an exchangeable one with the term total testing 
time. 

For the optimal testing-resource allocation problem, the following 
assumptions are made here: 

(a) n modules in a software are independent during the unit-testing phase. 

(b) After 7] unit time of testing, the failure rate of module i is /l, (7j ). 

The reliability of module i is 

y? i (^|7;.) = exp{-/l | .(7;.)Ar}, x>0 (9.10) 

where x is the operational time after testing. Note that in the above, we have 
used the operational reliability definition (Yang & Xie, 2000) as it is more 
common that after the release, there will be no reliability growth, and hence the 
failure rate will remain constant equal to MTi)- 



9.2.1. Allocation on serial modular software 

If the software system fails whenever there is a failure with any of the modules 
that the software system is composed of, then it is called a serial software 
system. For many modular software systems this is right the case. The structure 
for such a system is illustrated in Fig. 9.4. 

— 0-0 0 — 



Fig. 9.4. Structure of a serial software system. 
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Denote by 7j the testing-time allocated to module i. After 7]- unit of time of 
testing, the failure rate of module i is MTt) . The reliability of module i is 

/?, (x 1 7] ) = exp{- A, (T t ).* } , x£0, i = l,2 n 

The reliability of the whole software system is given by 

/?( j: | T, , • • • 7; ) = PJ /?,- (jc 1 7;. ) = expj - * I A t (7) )} (9.11) 

(=1 l '=' 

The optimal testing-resource allocation problem is formulated as 



Maximize 


R(x\T i ,---T„) = expj- xt 4 (7) ) j 


(9.12) 


Subject to 


E- 

VI 


(9.13) 




Tj>0 , i = 1, 2,..., n 


(9.14) 



The formulation above is equivalent to minimizing the sum of failure 
occurrence rates, i.e., 

Minimize £4 (7 ;.) (9.15) 

1=1 

It can be noted that the general formulation presented above does not 
require a particular model for the mean value function and thus it has much 
flexibility. In fact, we could even use different software reliability models for 
different modules. 

In order to obtain a general solution to this problem, the Lagrangian is 
constructed as 

L = $ 4 MT i ) + *it T i- T 

/=! V i=! 



(9.16) 
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The necessary and sufficient conditions for the minimum are (Bazaraa el al., 
1979, p. 149) 



dL dMTj) 
37 ; 37 ] 



+ A>0, 



i = 1, 2 ,•••,« 



(9.17) 






(9.18) 



YJi-t =0 



(9.19) 



,* > 0 (9.20) 

The optimal solution 7J*, T ^, ..., T* can be obtained by solving the above 
equations numerically. Define g ( (f) = - dMt)fdt , then an equivalence of Eq. 
(9.17) is 



A*g,(T,) 



(9.21) 



For most software reliability models, g t (t) is a positive and 
non-increasing function. It is shown in Yang & Xie (2001) that if g,(f)>0 
and 8iU) is non-increasing on / € [0,71. let 

A,*g,(0), i = 1, 2, .... « (9.22) 



Then, if we reorder software modules 1, 2,.... n such that A t > /t 2 ' " - At > the 
optimal solution to the testing-resource allocation problem is: 





i = 1, 2, ...,k 
i = k + 



(9.23) 



k 

where X satisfies , and k satisfies A k > A > /4 i+1 . 
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From the results above, the optimum solution can be obtained by the 
following iteration algorithm. 

Step 1. Compute A , using Eq. (9.22). 

Step 2. Set 1=1. 

Step 3. Obtain A, by solving the following equation: 

= T (9.24) 

i=l 

Step 4. If Af > Af > A l+] , then X = X and the optimal solution can be 
obtained by Eq. (9.23), then stop. Otherwise set / = / +1 and go back 
to Step 3. 



Example 93. Assume that the software system is composed of three modules 
for which the testing processes follow the logarithmic Poisson execution time 
model (Musa & Okumoto, 1984). That is, 



0 = ln^ Si± l) > 
% 



i = 1 , 2,3 



It can be shown that 



8i(0 



{aw + 1) 2 ’ 



i = 1 , 2,3 



are positive and strictly decreasing on 1 6 [ 0 ,°°) . The optimization algorithm 
described in previous section can be used. In this case, Aj = CC t (p t . The solution 
to Eq. (9.24) is 





252 



Optimal System Design and Resource Allocation 



and Eq. (9.23) becomes 



1 1 




0 



i = 1, ■ • • , k 
i = A + 1, ■ • • ,3 



Suppose that the parameters of the three modules have been estimated by 
historical testing data and are summarized in Table 9.2, and an additional 5000 
CPU hours of testing-time is available to be allocated among these three 
modules. By solving the optimization problem as described in previous section, 
the optimal allocation is obtained and shown in Table 9.2. 



Table 9.2. Estimated parameters and optimal allocation for Example 9.3. 



Module 


a, 


(Pi 


T* 


1 


16.0 


0.09 


1534 


2 


6.0 


0.03 


2653 


3 


1.5 


0.32 


813 



The reliability of the software system after the additional 5000 hours of testing 
is: 



R(x | T { = 1534, T 2 = 2653, T 3 =813) = exp(-0.02361;c) , x > 0 



9.2.2. Allocation on parallel modular software 

The system is assumed to be a parallel redundant system (Fig. 9.5). For such a 
software system, the system will fail only when all modules fail. The achieved 
reliability of the system after unit testing phase is 

n 



/?(jr|r l ,---,r n ) = i-fJ[i-/? ( (x|7;.)], *>o 



(9.25) 
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where is the operational reliability of module i. 




Fig. 9.5. Structure of a parallel redundant software system. 



The optimal testing-resource allocation problem is formulated as 

Maximize R( x \ T { , • • • , T n ) = 1 - J"{ [1 - exp { -A, (7) )*} ] 



t=i 



Subject to ^ 7j < T 

i=l 

7j>0. i = l,2 n 

An equivalence of Eq. (9.26) is: 

Minimize £ln[l-exp{-4(7’);t}] 



(9.26) 

(9.27) 

(9.28) 

(9.29) 



<=i 



Now the optimal testing resource allocation problem is formulated by the above 
equations. The Lagrangian is constructed as: 

n f n \ 

L = £lnn-exp(-^(7;)jr}] + 4 £7)- T 

i= I Vi= 1 
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The necessary and sufficient conditions for the minimum are 



dL x 3^(7)) 

dT,' expiMTjU}- 1 37] 



+ /l>0, 



= 1. 2, n 



(9.30) 



’if' i = 1 ' 2 " 



•, n 



(9.31) 



( n - 



YJi-r 



= 0 



V/=i J 



(9.32) 

(9.33) 



The optimal solution 7j*, r 2 *, ..., T* can be obtained by solving the above 
equations numerically. 



9.2.3. Allocation on mixed parallel-series modules 

The Fig. 9.6 is the structure of a mixed parallel-series modular software system. 
There are n groups of parallel modules and m serial modules. 




Fig. 9.6. The structure of a parallel-series modular software system. 
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Single objective of maximizing reliability 

The reliability for this parallel- series modular software system is calculated as 
following equation 

n f 1 m 

R(x\T) = ni-nti-*. ix\T„)} nW 7 ;) (9-34) 

1 = 1 1 i=i J j - 1 

where 7} is the testing time allocated to module i. Then, the following 
optimization model is to maximize system reliability: 

Maximize R(x\T ti ,Tj >=n (9.35> 

/=i [ <= i J j = l 

Subject to £Z 7 « + Z r /- 7 ' ( 9 -36) 

(=1 <=1 M 

T u Jj>0 

in which T is the total resource of time consuming in all modules of parallel 
group (7j, ) and serial modules {T } ). 



Multiple objectives of maximizing reliability and minimizing cost 

Assume that the cost function of Module i is in which /?,■ is the 

reliability for the i:th module. The total cost in the parallel- series modular 
software system of Fig. 9.6 will be 

C(R u ,Rj) = iicM + ^CjiRj) (9.37) 

/=! i=l ]-\ 



where 

± 

yc lt (R u ) is the total cost of the /:th groups of parallel modules 
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n ki 

is the total cost of all the n groups of parallel modules, and 

l=\ i=t 
m 

'Z c j(Rj) is the total cost of all the series modules. 

M 

Here, we adopt the cost function for individual module i shown by Eq. (9.4), 
proposed in Kumar & Malik (1991). 

The optimal testing-resource allocation problem can then be formulated 
with two objectives as 

n f h 1 m 

1 ) Maximize R(x\T u , 7j ) = n i-nt i -*/<M 7 ;)]n*/*i 7 }) ^ 

i=i i i=i j j=i 



2) Minimize C(R ti , Rj) = C M + E Cj(Rj) (9.39) 

/=1 i-l 

n ki m 

Subject to EZ 7 i- + E 7 ';- 7 ’ ( 9 - 40 ) 

/=i 1=1 y=i 

T«Jj>0 

in which T is the total resource of time consuming in every modules of parallel 
group (7), ) and serial modules (Tj ). 

For mixed parallel-series modular software, it is difficult to solve them, so 
the heuristic algorithms such as genetic algorithm, simulation annealing or 
Tabu search can be applied. Dai et al. (2003b) presented a genetic algorithm to 
solve the above multi-objective allocation problems. Here an example of this 
type is illustrated with that genetic algorithm. 



Example 9.4. The structure of this 8 modules example is shown in Fig. 9.7. 
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Fig. 9.7. The structure of a complex parallel-series modular system. 



We use the GO-model for illustration. The mean value function is: 

«!,(/) = a,[l-expH> ( /)], i = 1,2,... 8 (9.41) 

We assume here that the total testing time is 23000 hours and x is 200 hours to 
complete the given task. The values of parameters and optimal solution out of 
the genetic algorithm are given in the following Table 9.3 where T‘ 
(t=l,2,.,., 8) is the optimal allocated testing time on different modules. 



Table 9.3. The parameters of parallel-series modular software system. 



Modules 


a , 


b, 


", 


B, 


D, 


t; 


1 


210 


0.00051 


3.493 


6.011 


4.97 


93.47 


2 


199 


0.00059 


3.503 


6.12 


4.93 


10522 


3 


453 


0.00048 


3.498 


6.012 


4.995 


0 


4 


345 


0.00058 


3.498 


6.001 


4.997 


54.11 


5 


258 


0.00063 


3.499 


6.002 


4.995 


60.48 


6 


221 


0.00074 


3.5015 


6.15 


4.97 


8822.8 


7 


33.99 


0.00579 


3.495 


6.01 


4.98 


2190.83 


8 


32.32 


0.00593 


3.500 


6.005 


4.01 


1256.31 
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9.3. Resource Allocation - Dependent Modules 

A method to increase the reliability of safety critical software is the /V-version 
programming technique, e.g. Avizienis (1985). In the analysis of this type of 
systems, a common assumption is the independence of different versions. In the 
following, we first present a model for the dependent N versions of software. 
Then, based on the model, optimum allocation problem of the testing 
resource/time on the dependent N versions is discussed. 



9.3.1. Reliability analysis for dependent N- version programming 

The /V-vcrsion programming involves the execution of multiple versions of 
software. A voting scheme matches and tests the outputs, and then determines a 
final result. There are various voting schemes. Here we use the voting scheme 
of “selecting the first qualified result”, which is explained in details in Belli & 
Jedrzejowicz (1991). In this voting scheme, if any one version among the N 
versions of software passes a test, the voter will select it as the final result no 
matter whether the other versions are qualified or not. 



Decomposition by multi-component modeling 

In the /V-version software, any j versions may fail at the same time because of 
certain common cause failures. For example, if j versions of the /V-vcrsion 
software share a common subroutine, these j versions may fail simultaneously. 
We define a parameter for such failure, called dependence level , by the number 
of simultaneously failed versions caused by the failure. 

We denote Mj k as the “components” that correspond to different 
common cause failures, where j (/= 1 is the dependent level that 
correlates any j out of N versions and k (&=1,2 ,...,K N j) represents the A::th 
component among all the j':th dependent components (M j,), where 





Computing System Reliability 



259 



If all those failures with the y:th dependent level are numbered by k {k~ 1,2,..., 
K nj ), M jJt can represent all the failures with different dependent levels, 
respectively. The total number of all “components” Mj k (J=l,2,..,,N\ 
k=l,2,..., K N j) is equal to 2 W -1. 

The N dependent versions of software can be decomposed into the 
mutually exclusive 2 N -1 components. Note that the Aversions may not be 
physically separated. An example of three-version programming is illustrated 
below. 



Example 9.5. Consider a fault-tolerant system with three versions of software 
which might be dependent. The three dependent versions are correlated as 
shown in Fig. 9.8 and the states can be decomposed into 7 mutually exclusive 
parts, called components here. 




Fig. 9.8. Three dependent versions of software. 
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Let k (k= 1,2,3) denote the failures that affect only the fcth version without 
influence on the other two versions; M 2 (k= 1 , 2 ) denote the common cause 
failures that correlate the k: th and (k + 1 ):st versions without influence on the 
other one version; Af 2 3 represents the failure that correlates the first and the 
third versions; and M 3l denotes the failures that correlate all the three 
versions. 

The reliability block diagram for those components can be built as shown in 
Fig. 9.9. 




Fig. 9.9. Reliability block diagram of the decomposed components. 



The reliability block diagram is complex containing not only many 
parallel-series units but also some bridge structures. Moreover, the diagram will 
become much more complicated for four or more versions. Hence, the 
reliability estimation for dependent Aversion programming is not 
straightforward. In order to analyze the system reliability based on our above 
model, a general approach is presented below. 



System reliability function 

The reliability of a component M Jk is defined as the probability for the 
corresponding common cause failure not to occur, which is denoted by 
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The software reliability of the dependent /Y-version programming is 
defined as the probability that at least one version of software can achieve the 
task successfully. The software reliability function at time t can be expressed as 

R(t) =Pr(at least one version of software is reliable at time t) (9.42) 

Let E,(f) represent the event in which the i:th version of software is reliable 
to successfully achieve the given task at time t, (i=l,2,...,N). The software 
reliability function for the dependent /V- vers ion programming can then be 
written as 



K(/) = Pr{U £,(')} 

i=l 



(9.43) 



By using conditional probability, the events considered in the above equation 
can be decomposed into mutually exclusive events as 



R(t) = Pr{£,(/)} + P r (£ 2 (r)}Pr{£ 1 (l)j£ 2 (0} + 

•■• + Pr{£„(f)}Pr{ £, (/), E 2 (/)••■ (/)|E W (/) 



(9.44) 



where Pr(£,(r)|£ 2 (f)) denotes the conditional probability that the first version 
of the software fails given that the second version of the software is reliable at 
time t. 

Hence, each term in the software reliability expression of the above 
equation can be evaluated in terms of the probability of two distinct events. The 
first event indicates that the nth version of software V ( is reliable while the 
second event indicates that all of its previous versions V m (m=l,2,...,i-l) fails 
given that Vj is reliable. 

The probability of the first event, Pr{ £)(/)} , is straightforward. It can be 
calculated by multiplying the reliability functions of all the components that 
will make the nth version V' fail as 





262 



Optimal System Design and Resource Allocation 



Pr{£,(f)}= ]!*„(') (945) 

V i sM j,k 



where V i eMj k means that the /:th version of software V" will fail if the 
component Mj k fails. 

The probability of the second event, Pr[E i (t),E 2 (t)---E j ^(t)\E i (t )} , is not 
as straightforward to compute. It can be done in the following steps: 

Step 1: select all those components that can make any version(s) among the 
Vj,V 2 ,...,V’ ( _ 1 fail while V ' is still reliable. 

Step 2: use binary search tree (Johnsonbaugh, 2001) to find out all the 
exclusive combinations, which can make all the i - 1 versions 
Vl.Vjv ,V ( _i fail among those components selected in step 1. 

Step 3: add up all the probabilities of those exclusive combinations to 
obtain the probability of Pr{ £, (/), E 2 (t ) • • • (t)\E l (t)} . 

After computing PrfiS'j (/), ^(Z) • - • 2S’,_| (^)J and Pr { Cf ) } . 1=1, 2,..., N, 

we can obtain the software reliability function for the dependent A-version 
programming by substituting them into Eq. (9.44). An example of aircraft 
landing is illustrated below. 



Example 9.6. Suppose that three teams will compose three versions of a 
program to control the aircraft landing. If any one version is working, the 
aircraft can land successfully. These three versions may depend on each other 
through certain common cause failures. Those failures may occur on the 
common parts of some versions, such as using the same external electrical 
power, integrating the same software packages, sharing identical subroutines 
and so on. 
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As in the approaches presented above, the software is first decomposed 
into its individual components. As shown in example 9.5, The three dependent 
versions can be decomposed into 7 components corresponding to different 
common cause failures as shown in Fig. 9.9. Rj k denotes the reliability 
function of M j k . We have then 

R(t) = 1^21^23^31 ^12^21 ^22^31 0 — ^23^11 ) 

(9.46) 

+ ^13^22^23^31 [1 — ^2l(^ll ^12 — ^11^12)1 



9.3.2. Optimal testing resource allocation 

An optimization problem for testing resource allocation can be formulated to 
minimize the total cost for the N versions, when constrained by a fixed testing 
time budget Thoms. Let t, be the testing time allocated on the i: th version V ( 
(i= 1,2,. ..,1V), and the total testing time is less than T. The allocation of testing 
time significantly affects the total cost. There are mainly two parts in the cost: 

(a) Test duration cost C, : Here, the N versions of the software can be 
tested respectively given their allocated testing time /, and their 
expected cost per unit of testing time c,- (i=l,2,...N). The test duration 
cost can be expressed as 

(9.47) 

i=i 

where c,T ( is the expected cost in testing the /:th version. 

(b) Risk cost C r this is the cost incurred by an unreliable system, see e.g. 
Pham and Zhang (1999). This can be expressed as 



C r =d(l-R) 



(9.48) 
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where d is the expected cost if the system fails and 1 -R is the 
probability for the system to fail. 

The total cost is the summation of the above two parts. 

Denote by tj k the testing time for component M j k . During the testing 
period, the component M j k continues running and risks failure unless all the 
versions related to M jk fail. Hence, the testing time of tj k can be 
calculated by 

t Jik = max (tj (9.49) 

where me M Jk means version m is related to component M j k . Hence, the 
reliability function of the component Mj k can be written as Rj k (x\tj k ) 
where x is the operation time after the test. The software reliability function 
R(x\t) can then be derived through our approach presented above, where 
The optimization problem to minimize the total cost by 
finding a set of testing time allocations t , can be formulated by 

N . 

Minimize C(t ) = C, + C r = ^c,f; + d[l - R(x | F) j (9.50) 

i=i 

N 

Subject to: J^t, <T (9.51) 

i=i 

(>0 (1=1, 2,..., AO (9.52) 

Solving this problem is also difficult, so heuristic algorithms need be 
implemented. An example is illustrated where a genetic algorithm is used here 
to solve it. 
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Example 9.7. Continuing with Example 9.6 (the air-craft landing example), 
suppose that the testing resource budget is 2000 hours of testing time, i.e., 
T=2000, that the testing cost per hour on the three versions are q = 0.3 , 
c 2 = 0.2 , c 3 = 0.28 , and that the risk cost d = 10000 if the aircraft cannot land 
successfully. The allocation problem becomes how to optimally allocate the 
2000 hours on the three versions in order to minimize the total cost. 

We assume that the common cause failures arriving on each component 
satisfy the Goel-Okumoto (GO) model. That is, the failure rate function for the 
components Mj t (/'=1,2,3 and &=1,2,..., K i j) is modeled with: 

Ajjt (' ) = a hk bj' k exp (~b jk t) (9.53) 

If the testing is stopped after t units of time, the reliability for a mission of 
duration t is given by (Y ang & Xie, 2000) 

Rj,k (*10 = exp{-/l M (t) ■ x] (9.54) 

The values of the parameters Oj k and b j k in the GO model are given in 
Table 9.4 for this example. 



Table 9.4. Parameters of the GO-modei for each component. 



Component 






Mu 


M u 


m 2 , 2 


m 2 . 3 


m 3 ,, 


°j * 


16.91 


95.52 


21.56 


15.80 


22.45 


26.23 


6.25 


b ,> 


0.0059 


0.0006 


0.0041 


0.0028 


0.0021 


0.0022 


0.0056 



Then, the reliability for the dependent three-version software can be obtained 
through Eq. (9.46). Substitute the parameters of Table 9.4 into Eq. (9.54) to 
compute the reliability functions of all the components, and then substitute them 
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into Eq. (9.46) to compute the software reliability by assuming x=5 (i.e. it will 
take 5 hours for the aircraft to land). 

To solve the optimization problem as Eqs. (9.50-9.52), a genetic algorithm 
is used to get the solution F ={638.2, 1361.8, 0). The best allocation of the 
2000 hours should be to test: the first version for 638.2 hours; the second for 
1361.8 hours and the third for 0 hour. The total expected cost C(F) =579.48 
and the software reliability R(5 1 1) =0.988434. 



9.4. Optimal Design of the Grid Architecture 

9.4.1. Grid architecture design 

For the grid computing systems (see Chapter 7), the network architecture is an 
important factor. Although the physical network may have already existed 
when building the grid, constructing a direct link between two remote nodes 
still lead to high cost where the direct link means that both nodes have the right 
to use the shared resources from each other. Hence, the cost of a direct link is 
mainly caused by preparing the resources, purchasing the right to use the 
resources, or dealing with the security problem during communication. Such 
cost is called link cost. Here a link can be a virtual link through the 
Internet/Intranet or even wireless. 

On the other hand, if the grid computing system cannot complete the given 
tasks successfully (such as provide services), another kind of cost, called risk 
cost (Pham & Zhang, 1999) is caused by the unreliable computing. Hence, the 
total expected cost for designing the grid network architecture should consider 
both link cost and risk cost. 

As Fig. 9.10, denoted by PNj the set of programs executed by node i and 
RNi the set of resources prepared in node i. Suppose the grid architecture (i.e. 
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the network of links among nodes) need be designed given PNt and RN, 
(i=l,2..JV). 




PN, 

RN. 



Fig. 9.10. Grid architecture design. 



Adding more links among the nodes might increase the link cost but they 
could improve the system reliability to reduce the risk cost. In order to 
minimize the total cost, how to optimally design the network architecture of the 
grid, i.e. which link should exist or not, is important. 



9.4.2. Optimization model 

Denote by (J^ the link between two nodes i and j (i * j ). If cr y = 1 , there 
exists a link between the two nodes, and if < 7 ^ = 0 , there is no link between 
them. Here, cr is defined as a vector of { O ij \iJe[l > N]J>i }, which 
corresponds to a network architecture of /V-node system. The length of cr is 
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N(N- 1 )/2, because the maximal number of links to completely connect all the N 
nodes is N(N- \ )/2. Thus, the total link cost can be expressed as 

i=l j=i + 1 

in which Ck is the cost to construct the link between node i and j. 

The grid system reliability given a network architecture a , denoted by 
GSR(a), can be derived from the algorithms presented in Chapter 7. We use 
the linear function of the risk cost as 

C r (a) = C R U-GSR(a)] (9.56) 

in which 1 -GSR(cr) is the probability of a failed task of the grid and C R is 
a constant which can be explained as the expected risk cost if the task fails. 

The total expected cost is the summation of link cost and risk cost as 

C(a) = D(a)+C r (a) (9.57) 

Our objective is to find an optimal network architecture <7* to minimize 
the total expected cost. The optimization model is 

N N 

Minimize C(tr) = £ 'Z <T u C u + c *[ 1 ~GSR(a)] (9.58) 

i=l _/=<+ 1 

Subject to ov =0 or 1 , i, j e [1, N], j > i (9.59) 

If there are N nodes in the grid, the length of o is N(N- 1 )/2 and the sample 
size of total network architectures is 2 N(N ’ I,/2 . Since the sample size increases 
exponentially to the number of nodes, it is difficult to exhaustively search all 
the samples for the optimal solution in a complex grid with many nodes. 
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Fortunately, for such complex grids, it is usually sufficient to find a good 
enough network design through certain heuristic algorithms, although it may 
not guarantee the optimum solution. 



9.5. Optimal Integration of the Grid Services 

9.5.1. Grid service integration problem 

After the grid system is built, new services and resources are able to be further 
integrated into the grid by various virtual organizations. This is the objective of 
the second generation of the grid such as the Open Grid Service Architecture 
presented by Foster et al. (2001). 

A grid service is to complete certain programs by using some resources 
distributed in the grid, as mentioned in Chapter 7. Flence, the integration of a 
new service on the grid is to allocate the programs and resources used by the 
service on certain reachable nodes of the grid. The reachable nodes represent 
those nodes that can be reached and used to upload the programs or integrate 
the resources of the new grid service. 

In Chapter 7, we have analyzed the grid service reliability which is a 
special type of the grid system reliability by considering the programs of the 
given service in the grid. Recall that the grid service reliability is the 
probability that all the programs of a given service are achieved successfully 
under the grid computing environment. Maximizing it will serve as the 
objective of the service integration problem in this section. 

The problem here is how to optimally allocate/integrate those programs 
and resources on the reachable nodes in order to maximize the service 
reliability after the integration. The organization may wonder how many 
program/resource redundancies they should prepare under the budget 
limitation, and how to distribute them on different reachable nodes of the grid 
system whose physical architecture has been constmcted. 
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Suppose that a grid service is desired to complete M programs 

W Pm which requires to access to H resources, R l ,R 2 ,..., R h ,..., R fj . 

These programs and resources are viewed as the components of the grid 
service. Denoted by C, (/= 1 \ ) the z':th component of the 

service where the first M components corresponds to the M programs and the 
rest H components represent the H resources. The organization can prepare 
redundancies for each component (program or resource) but there is a budget 
constraint for the total cost denoted by B. Moreover, the number of 
redundancies for the z:th component should be no less than one and no more 
than an upper-bound denoted by U i (i= 1 ,2,...,K) where K=M+H (the total 
number of programs and resources in a grid service). 

The organization can integrate the K components on N reachable nodes 

(G„G 2 G n ) of the grid. Each node may have a limitation to integrate the 

components, such as the maximal number of components denoted by fi. 

0=1,2,...,AO. 

Also, some components may have been fixed on some specific nodes, and 
some components are allowed to select some nodes to integrate or not allowed 
to integrate on other nodes. Hence, we use a y to describe the relationship 
between the z:th node (G,) and the yith component ( C ^ ) , i — l,2,...,iV and 
j-\,2,...,K. Also, has three possible values ( d , 0 and 1): if a i; = d , it 
means the organization can freely choose whether to integrate the Cj on the 
G, or not; if a { j — 0 , the Cj is not allowed to be integrated on the G f ; and if 
fly = 1 , the Cj is fixed on the G,- . That is, a is defined as a vector of 
{ fl, y | i e [1, N ] , je[\,K}}. The values of a can be determined given the 
relationship of the nodes and the service components. 

Based on these conditions, the organization has to prepare the redundancies 
of each component and distribute them on those reachable nodes of the grid. To 
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maximize the grid service reliability, the next section presents an optimization 
model for integrating a new service on the grid. 



9.5.2. An optimization model 

Let (Jjj represent the integration of the / : t h component (Cj) on the /:th node 
( Gj ): Gjj = 0 means that the C y is not integrated with the G, and <7 tJ =1 
means that Cj is integrated with G, . Let <7 be defined as a vector of 
{ cty|t€{l,yV], je[l,K}) which represents an integration schedule of the grid 
service components on the nodes. Hence, given the structure of the grid, the 
grid service reliability can be only determined by the integration scheduling 
vector a . The grid service reliability is then expressed by R s (CT) , which can 
be computed by the algorithms in Chapter 7.3. Thus, the optimization problem 
becomes to find the optimal solution of <7 so that the integrated grid service 
reliability is maximized. The optimization model is described as follows: 



Decision variables: cr={ f7^=0,l|i 6 [1, A], j e [1, K ] } (9.60) 

Objective function: Maximize R s (cr) (9.61) 

Constraints: 

cr.. = a i} (where a tj *d), i = l,2...A, ; = l,2.-K (9.62) 

1 ^a^Uj j = \,2...K (9.63) 

1=1 

* = 1,2... A (9.64) 

J = » 

N K 



ZZcjVij^B 
'= i 7=1 



(9.65) 
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where the first constraint (9.62) is limited by a , the relationship of nodes and 
components: 

if a { j = 1 (i.e. the component Cj has to be fixed on G, ), the value of 
has to be set to 1 ; and 

if tty = 0 (i.e. the component Cj cannot be integrated on G, ), the value 
of (Jjj has to be set to 0 . 

ft 

In the second constraint, Eq. (9.63), 7 (T- represents the total number 

of redundancies of component Cj integrated in the grid, which should 
between 1 and its upper-bound U j . In the third constraint, Eq. (9.64), 
represents the total number of components integrated on the node 
G ( , which ought to be no more than its upper-bound /? . Finally, for Eq. 
(9.65), Cj is the cost to prepare a redundancy of the /:th component, so that 
the left hand side represents the total cost for the integration of the grid service, 
which has to be no more than the budget B. 

In order to solve this optimization problem, heuristic algorithms can be 
used. 



9.6. Notes and References 

For the optimal number of redundant units, many other studies have also been 
presented. Pham (1992) determined the optimal number of spare units that 
minimize the average total system cost. Imaizumi et cil. (2000) obtained the 
mean time and the expected cost until system failure and discussed an optimal 
number which minimizes the expected cost for a system with multiple 
microprocessor units. Hsieh & Hsieh (2003) developed a relationship between 
system cost and hardware redundancy levels, and presented an optimization 
model aiming at minimizing the total system cost. Hsieh (2003) further 
presented optimization models for the policies of task allocation and hardware 
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redundancy of the distributed computing systems. Chang et al. (2003) further 
presented an optimization model in dynamically adding and removing 
redundant units of the computing system. 

Optimal testing-resource allocation problem has also been studied 
extensively in the literature. Yamada & Nishiwaki (1995) proposed optimal 
allocation policies for testing-resource based on a software reliability growth 
models. Based on the hyper-geometric distribution software reliability growth 
model, Hou et al. (1996) investigated two optimal resource allocation problems 
in software module testing. Leung (1997) later studied the dynamic 
resource-allocation for software-module testing. Coit (1998) presented a 
method to allocate subsystem reliability growth test time in order to maximize 
the system reliability when designers are confronted with limited testing 
resources. Lyu et al. (2002) further considered software component testing 
resource allocation for a system with single or multiple applications. For the 
networked system, Hsieh & Lin (2003) aimed to determine the optimal 
resource allocation policy at source nodes subject to given resource demands at 
sink nodes such that the network reliability of the stochastic-flow network is 
maximized. 

For the grid computing system design, Buyya et al. (2002) presented two 
optimization strategies in providing grid services based on the economic 
models of the resource management. Furmento et al. (2002) used composite 
performance models to optimally combine currently available component into 
the network of the Grid environment. According to the QoS requirements, 
Dogan & Ozguner (2002) presented the optimization model for scheduling 
independent tasks in grid computing with time- varying resource prices. 

Optimization models have been widely studied in other areas of the 
computing systems. Okumoto & Goel (1980) first discussed the software 
optimal release policy from the cost-benefit viewpoint. There are many 
follow-up papers and a chapter in Xie (1991) is devoted to this issue. Zheng 
(2002) considered some dynamic release policies. The sensitivity of software 
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release time remains an issue and some preliminary discussion can be found in 
Xie & Hong (1998). Quigley & Walls (2003) discussed the confidence intervals 
for reliability-growth models when the sample-size is small which is a common 
situation. 

Ashrafi et al. (1994) discussed the optimal design of /V-version 
programming system. Berman & Kumar (1999) considered some optimization 
models for recovery block design. Jung & Choi (1999) studied some 
optimization models for modular systems based on cost analysis. 

Tom & Murthy (1999) implemented graph matching and state space search 
techniques in optimizing the schedule of task allocation on the distributed 
computing systems. Karatza (2001) investigated optimal scheduling policies in 
a heterogeneous distributed system, where half of the total processors have 
double the speed of the others. Kuo & Prasad (2000) reviewed some 
system-reliability optimization models. Kuo & Zuo (2003) recently 
summarized many reliability optimization models in the computing systems. 
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